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EXTENDING THE RANGE OF COMPUTATIONAL FIELDS OF 
INTEGERS AND WIDTH OF SERIAL INPUT OPERANDS IN 
MODULAR ARITHMETIC PUBLIC KEY CRYPTOGRAPHIC CO- 
PROCESSORS DESIGNED FOR ELLIPTIC CURVE AND RSA TYPE 

COMPUTATIONS 



FIELD OF THE INVENTION 

The present invention relates to apparatus operative to accelerate cryptographic 
co-processing peripherals and additionally but not exclusively to the of such an 
accelerated processing apparatus for polynomial based and prime number field 
arithmetic, extending the range of computational fields of integers and width of 
serial input operands in modular arithmetic public key cryptographic 
coprocessors designed for elliptic curve and RSA type computations. 

BACKGROUND OF THE INVENTION 

Security enhancements and performance accelerations for computational devices 
are described in Applicant's U.S. Patents 5,742,530, hereinafter "PI", 5,513,133, 
5,448,639, 5,261,001; and 5,206,824 and published PCT patent application 
PCT/EL98/00148 (WO98/50851); and corresponding U.S. Patent application 
09/050958, hereinafter "P2", Onyszchuk et al's U.S. Patent 4,745,568; Omura et 
al's U.S. Patent 4,5877,627, and applicant's U.S. Patent Application 09/480,102; 
the disclosures of which are hereby incorporated by reference. Applicant's U.S. 
Patent 5,206,824 shows an early apparatus operative to implement polynomial 
based multiplication and squaring, which cannot perform operations in the prime 
number field, and is not designed for interleaving in polynomial based 
computations. An additional analysis is made of an approach to use the 
extension field in polynomial based arithmetic in Paar, C. F. Fleischmann and 
P. Soria-Rodriguez, "Fast .Arithmetic for Public-Key Algorithms in Galois Fields 
with Composite Exponents", IEEE Transactions on Computers, vol. 48, No. 10, 
October 1999, henceforth "Paar". W. Wesley Peterson and E.J. Weldon Jr., in 
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the second edition of "Error-Correcting Codes", published by the MTT Press, 
Cambridge, Mass., 1972, pages 174-179, demonstrated circuits for performing 
division in the polynomial based residual number system GF(2^), hereinafter 
"Peterson". Peterson's circuit can only be used in a device where the multiplier 
is exactly the length of the modulus. Typically, that would demand a device that 
would be more than twice as long as present devices, and would not be cost 
effective for compact implementations. It could not be used in interleaved 
implementations, and could not be useful where /is longer than I, as he has not 
provided an anticipatory device for determining the Y Q of a multibit character. 

Whereas, Knuth [D. Knuth, The art of computer programming, vol. 2: 
Seminumerical algorithms, Addison-Wesley, Reading Mass., 1981] page 407, 
implies that using an ordinary division process on a single I bit character in 
polynomial based division, we can assume a method to anticipate the next 
character in the quotient, this invention discloses a method for anticipating the 
next character of a quotient deterministically using a logic configuration. 

SUMMARY OF THE INVENTION 
It is an aim of the present invention to provide a microelectronic specialized 
arithmetic unit operative to perform large number computations in the 
polynomial based and prime integer based number fields, using the same 
anticipating methods for simultaneously performing interleaved modular 
multiplication and reduction on varied radix multipliers. 

A further aim of the invention also relates to a compact microelectronic 
specialized arithmetic logic unit, for performing modular and normal (natural, 
non-negative field of integers) multiplication, division, addition, subtraction and 
exponentiation over very large integers. When referring to modular 
multiplication and squaring using both Montgomery methods and a reversed 
format method for simplified polynomial based multiplication and squaring, 
reference is made to the specific parts of the device as a superscalar modular 
arithmetic coprocessor, or SMAP, or MAP or SuperMAP™, also as relates to 
enhancements existing in the applicant's U.S. Patent pending 09/050,958 filed 
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March 31, 1998 and a continuation in part, 09/480,102 filed on January 10, 
2000. 

Preferred embodiments of the invention described herein provide a modular 
computational operator for public key cryptographic applications on portable 
Smart Cards, typically identical in shape and size to the popular magnetic stripe 
credit and bank cards. Similar Smart Cards as per applicant's technolog y of US 
Patent 5,513,133 and 5,742,530, and applicants above mentioned pending 
applications are being used in the new generation of public key cryptographic 
devices for controlling access to computers, databases, and critical installations; 
to regulate and secure data flow in commercial, military and domestic 
transactions; to decrypt scrambled pay television programs, etc. and as terminal 
devices for similar applications. Typically, these devices are also incorporated in 
computer and fax terminals, door locks, vending machines, etc. 

The preferred architecture is of an apparatus operative to be integrated to a' 
multiplicity of microcontroller and digital signal processing devices, and also to 
reduced instruction set computational designs while the apparatus operates in 
parallel with the host's processing unit. 

This embodiment preferably uses only one multiplying device which 
inherently serves the function of two or three multiplying devices, basically 
similar to the architecture described in applicant's 5,513,133 and further 
enhanced in U.S. Patent application 09/050,958 and PCT application 
PCT/BL98/0048. Using present conventional microelectronic technologies, the 
apparatus of the present invention may be integrated with a controlling unit with 
memories onto a smart card microelectronic circuit. 

The main difference between hardware implementations in the polynomial 
based field, and in the prime number field, is that polynomial based additions 
and subtractions are simple XOR logic operations, without carry signals 
propagating from LS to MS. Consequently, there is no interaction between 
adjacent cells in the hardware implementation, and subtraction and addition are 
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identical procedures. The earliest public notice that the authors are aware of was 
a short lecture by Marco Bucci of the Fondazione Ugo Bordoni, at the Eurocrypt 
Conference Rump Session in Perugia, Italy, in 1994, though, even then, this well 
known to all engineers practiced in the art. 

Previous applicant's apparati, described in Pi and P2, were typically prepared 
to efficiently compute elliptic curve c ryptographic protocols in the GF(p) field. 
For use in the GF(2 q ) field, in this invention, we show that as there is no 
interaction between adjacent binary bits in the polynomial field, computations 
can be processed efficiently, simultaneously performing reduction and 
multiplication on a superscalar multiplication device without introducing 
Montgomery functions and Montgomery parasites. Multiplication in GF(2 q ) is 
performed as where the machine preferably starts from the most significant 
partial products. Reduction is performed by adding as many moduli as are 
necessary to reset MS ones to zeroes. As there is no carry out, in these additions, 
our results are automatically modularly reduced. In this invention polynomial 
computations are performed using the same architecture, wherein, in GF(2*) the 
operands are fed in MS characters first, wherein all internal carry signals are 
forced to zero. GF(p) computations are preferably executed as in PI and P2, 
wherein LS characters are processed first and MS characters last. 

The architecture has been extended to allow for a potentially faster 
progression, in that serial multipliers are I bit wide characters, and at each clock, 

an I bit character is emitted from the carry save accumulator, the CSA. This has 

somewhat complicated the anticipation process (Yq), in that for single bit wide 
busses, an inversion of an odd number over a mod 2* base is also an odd 
number, and the least significant bit of the J 0 multiplicand was always a one. For 
both number fields, however the reduction process is identical, assuming the 
switch out of (suppression of) carries, if we remember that our only aim is to 
output a k character zero string, and we regard the Y Q function only as a zero 
forcing vector. 
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The present invention also seeks to provide an architecture for a digital device 
which is a peripheral to a conventional digital processor, with computational, 
logical and architectural novel features relative to the processes described in US 
Patent 5,513,133. 

A concurrent process and a hardware architecture are provided, to perform 
modular exponentiation without division preferably with the same number of 
operations as are typically performed with a classic multiplication/division 
device, wherein a classic device typically performs both a large scale 
multiplication and a division on each operation. A particular feature of a 
preferred embodiment of the present invention is the concurrency of larger scale 
anticipatory zero forcing functions, the extension of number fields, and the 
ability to integrate this type unit for safe communications. 

The advantages realized by a preferred embodiment of this invention result from 
a synchronized sequence of serial processes. These processes are merged to 
simultaneously (in parallel) achieve three multiplication operations on n 
character operands, using one multiplexed k character serial/parallel multiplier in 
n effective clock cycles, where the left hand final k characters of the result reside 
in the output buffer of the multiplication device. This procedure accomplishes 
the equivalent of three multiplication computations in both fields, as described 
by Montgomery, for the prime number field and the equivalent of two 
multiplications and a division process in GF(2 g ). 

By synchronizing loading of operands into the SuperMAP and on the fly 
detecting values of operands, and on the fly preloading and simultaneous 
addition of next to be used operands, the apparatus is operative to execute 
computations in a deterministic fashion. All multiplications and exponentiations 
circuitry is preferably added which on the fly preloads, three first k character 
variables for a next iteration squaring sequence. A detection device is preferably 
provided where only two of the three operands are chosen as next iteration 
multiplicands, eliminating k effective clock cycle wait states. Conditional 
branches are replaced with local detection and compensation devices, thereby 
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providing a basis for a simple control mechanism. The basic operations herein 
described are typically executed in deterministic time in CF(p) using a device 
described in US Patent 5,513,133 to Gressel et al or devices as by 
STMicroelectronics in Rousset, France, under the trade name ST19-CF58. 

The apparatus of the present invention has particularly lean demands on external 
volatile memory for most operations, as operands are loaded into and sto red in 
the device for the total length of the operation. The apparatus preferably exploits 
the CPU onto which it is appended, to execute simple loads and unloads, and 
sequencing of commands to the apparatus, whilst the MAP performs its large 
number computations. Large numbers presently being implemented on smart 
card applications range from 128 bit to 2048 bit natural applications. The 
exponentiation processing time is virtually independent of the CPU which 
controls it. In practice, architectural changes are typically unnecessary when 
appending the apparatus to any CPU. The hardware device is self-contained, and 
is preferably appended to any CPU bus. 

In general, the present invention also relates to arithmetic processing of large 
integers. These large numbers are typically in the natural field of (non-negative) 
integers or in the Galois field of prime numbers, GF(p), composite prime 
moduli, and polynomial based numbers in GF(2 q ). More specifically, preferred 
embodiments of the present invention seek to provide devices that can 
implement modular arithmetic and exponentiation of large numbers. Such 
devices are suitable for performing the operations of Public Key Cryptographic 
authentication and encryption protocols, which, in the prime number field work 
over increasingly large operands and which cannot be executed efficiently with 
present generation modular arithmetic coprocessors, and cannot be executed 
securely in software implementations. Preferably, the same general architecture 
is used in elliptic curve implementations, on integers which are orders of 
magnitude smaller. Using the novel reverse mode method of multiplication, 
polynomial arithmetic is advantageous as generating zeroes does not encumber 
computations with the parasitic 2"' factor. 
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The architecture offers a modular implementation of large operand integer 
arithmetic, while allowing for normal and smaller operand arithmetic enabled by 
widening the serial single character bus, i.e., use of a larger radix. Typically, this 
is useful for accelerating computations, for reducing silicon area for 
implementing the SuperMAP, and for generating a device of length compatible 
with popular Digital Signal Processors (DSP). 



For modular multiplication in the prime and composite field of odd numbers, A 
and B are defined as the multiplicand and the multiplier, respectively, and N is 
defined as the modulus in modular arithmetic. /V, is typically larger than A or 5. 
N also denotes the composite register where the value of the modulus is stored. 
N, is, in some instances, typically smaller than A. A, B, and N are typically n 
characters long, where characters are typically one to 8 bits long, k the number 
of I bit characters in the size of the group defined by the size (number of cells) of 

the multiplying device. Similarly, in polynomial based GF(2 <? ) computations, the 
modulus, N, is n bits long wherein the MS bit is a one (a monic), and the A, S 
and B operands are also, when properly reduced, n bits long. If a result of a 
GF(2 Q ), computation is monic it is preferably "reduced" to a value with an MS 
zero, by XORing said result value with the modulus. In a preferred embodiment, 
as the first significant bit of a GF(2^) is formed in the reverse mode, the MAP 
can sense if the bit is a one, and perform the preferred reduction. 

In the prime field, =, or in some instances =, is used to denote congruence of 
modular numbers, for example 16=2 mod 7. 16 is termed "congruent" to 2 
modulo 7 as 2 is the remainder when 16 is divided by 7. When Y mod N = X 
mod N\ both Y and X may be larger than N\ however, for positive X and Y, the 
remainders are identical. Note also that the congruence of a negative integer Y, is 
Y + u-N, where N is the modulus, and if the congruence of Y is to be less than N, 
u is the smallest integer which gives a positive result. 

In GF(2*) congruence is much simpler, as addition and subtraction are identical, 
and normal computations typically do not leave a substantial overflow. For 
N = 1101 and A = 1001, as the left hand MS bit of A is 1, we must reduce 
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("subtract") N from A by using modulo 2 arithmetic, A XOR N = 1001 XOR 
1101 =0100. 

The Yen symbol, ¥, is used to denote congruence in a limited sense, especially 
useful in GF(p). During the processes described herein, a value is often either 
the desired value, or equal to the desired value plus the modulus. For example 
X¥2 mod 7. X can be equal to 2 or 9. X is defined to have limited congruence to 
2 mod 7. When the Yen symbol is used as a superscript, as in 5 ¥ , then 0 < Z? ¥ < 2 
TV, or stated differently, 5 ¥ is either equal to the smallest positive B which is 
congruent to B ¥ t or is equal to the smallest positive congruent B plus N 9 the 
modulus. Other symbols, specific to this invention appear later in this summary. 

When X = A mod N, X is defined as the remainder of A divided by N; 
e.g., 3 = 45 mod 7, and much simpler in GF(2 q ) - 1 1 1 1 mod 1001 =0110. 

In number theory, the modular multiplicative inverse of X is written as X'\ 
which is defined by X-X~ x mod/V= 1. 1£X= 3, and N = 13, then X~ l = 9, i.e., the 
remainder of 3-9 divided by 13 is 1 in GF(p). 

For both number fields, we typically choose to compute the multiplicative 
inverse of A using the exponential function, e.g., A" 1 mod q = A q ~ 2 mod q. 

The acronyms MS and LS are used to signify "most significant" and "least 
significant", respectively, when referencing bits, characters, and full operand 
values, as is conventional in digital nomenclature, but in the reversed mode 
polynomial base, operands are loaded MS data first and LS last, wherein the bit 
order of the data word is reversed when loaded. 

Throughout this specification Af designates both the value N, and the name of 
the shift register which stores N. An asterisk superscript on a value, denotes that 
the value, as stands, is potentially incomplete or subject to change. A is the value 
of the number which is to be exponendated, and n is the bit length of the N 
operand. After initialization when A is "Montgomery P field normalized" to A* 
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(A*=2 n A - explained in PI) A* and N are typically constant values throughout 
the intermediate step in the exponentiation. In GF(2*) computations where 
computations might be performed with the normal unreversed positioning of bits 
we would be bound by this same protocol. However, using the reversed format, 
our computations generate most significant zeroes, which are disregarded, and 
do not represent a multiplication shift, as there is no carry out. 



During the first iteration, after initialization of an exponentiation, B is equal to 
A*. B is also the name of the register wherein the accumulated value that finally 
equals the desired result of exponentiation resides. S or 5* designates a 
temporary value; and S designates the register or registers in which all but the 
single MS bit of a number in GF(p) S is stored. (S* concatenated with this MS 
bit is identical to 5.) S(i-1) denotes the value of S at the outset of the i'th 
iteration. In these polynomial computations there is no need to perform modular 
reduction on S. 

Typically, Montgomery multiplication of X and Y in the prime number field is 
actually an execution of (X-Y-2~ n ) mod N, where n is typically the number of 
characters in a modulus. This is written, ?P(A*B)N t and denotes MM or 
multiplication in the P field. In the context of Montgomery mathematics, we 
refer to multiplication and squaring in the P field and in the polynomial based 
field as multiplication and squaring operations. 

We will redefine this innovative extension of a Montgomery type arithmetic in 
the GF(2 ? ) to mean a reversed format data order, wherein MS zero forcing does 
not change congruence, or initiate a burdensome parasitic factor. We may thus 
introduce a new set of symbols to accommodate the arithmetic extension, and to 
enable an architecture with wider serial multiplier buses. Such a more than one 
bit serial multiplier stream is preferable for enabling a natural integer superscalar 
multiplier such multiplier device may accept 32 bit multiplicands and 4 bit 
multipliers, into an apparatus that can perform modular arithmetic 
multiplications and reductions simultaneously. 
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Symbols in the Serial/Parallel Super Scalar Modular Multiplier Enhancement 

/ Number of bits in a character (digit). 

r Radix of multiplier character, r - 2 l . 

n Size of operands (multiplier, multiplicand and modulus) in characters. In the 
demonstration of the computations in the GF(p) field of Montgomery 
arithmetic, i is equal to one, and n is the bit length of the modulus operand. 
L^ngrh^rserial-parallermuTtipiier in characters. 

m Number of interleaved slices (segments) of multiplicand: m - n/k. 

Si Partial product result of / 'th MM iteration; 0 < i < m - 1; S 0 = 0. 

S i0 Right hand character of the / 'th iteration result, after disregarding the first k 
character right hand zeroes of Z. 

S[ The left hand n - k characters of the / 'th result. 

Sij j 'th character of 5,-. 

A Parallel multiplicand consists of m-k characters. 

Ai The i 'th k character slice of A, (and/or register storing A,) 

An The r 'th character of A\. 

B A serial multiplier (and/or register store of B). 
B 0 First right hand k characters of B. 
B Last left hand n - k) characters of B. 
B 0 j j 'th character of Bq, 
Bj j 'th character of B, 

N Modulus operand, (and/or register storing said multiplier). 
TVo The Right Hand k characters of N. 

{the LS characters in GF(p); MS characters in GF(2 g ) } 
N (n - k ) Left Hand characters of N. 

{ MS characters in GF(p); LS characters in GF(2*) } 
Nqj j 'th character of Nq. 
Nj j 'th character of N. 

Yq Zero forcing variable required for both Montgomery multiplication and 

reduction in GF(p). Y 0 is k characters long. 
Yqj j 'th character of Yq. 
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R A summation of the value residing in the Carry Save Accumulator, (includes 
unresolved internal carry ins) and the carry out bit from the final serial 
summator, 460. 

Joo Zero forcing character function of the modulus, N, for "on the fly" finite field 
multiplication and reduction. For I = 1, J 00 is always equal to 1. 

Carryj j \h internal carry character of radix r serial-parallel multiplier. 

— Garry- a radixv~cany-of-'outpurserial^ 

Surrij j th internal sum character of radix r serial-parallel multiplier. 

LS Least Significant. 
MS Most Significant. 

|| Concatenation, e.g. A = 1 10, B = 1 101 ; A || B = 1101101. 

Right Hand A Least Significant portion of all GF(p) computation data blocks and 

an MS portion of the reversed GF(2*) format. 
Left Hand A Most Significant portion of all GF(p) computation data blocks and 

an LS portion of the reversed GF(2 q ) format. 
GF(p) Galois Field, strictly speaking finite fields over prime numbers where we 

also use composite integers (product of two very large prime numbers) that 

allow for addition, subtraction, multiplication and pseudo-division. 
GF(2^) Galois Fields using modulo 2 arithmetic. 

© An operator or device which may be externally switched to add or subtract 
integers with or without carries, as befits the specific number system. 

® An operator or device which may be switched to either execute 
multiplication over GF(p) or multiplication over GF(2*). 

if The number field switch, where if: 

if = 1, the switch is operative to enable all carry in/outputs for GF{p) computations; 
^ = 0, the switch is operative to disable all cany in/outputs for GF(2*) 
computations. 

SuperMAP Any one of a member of the proprietary Superscalar Modular 
Arithmetic Processor family that is the subject of this invention. The 
SuperMAP is registered in Europe and is pending in the United States. 
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According to a first aspect of the present invention, there is therefore provided 
a microelectronic apparatus for performing ® multiplication and squaring in 
both polynomial based C¥(2 Q ) and GF(p) field arithmetic, squaring and 
reduction using a serial fed radix 2 l multiplier, S, with k character multiplicand 
segments, Aj f and a k character © accumulator wherein reduction to a limited 
Congruence is performed "on the :ly M , in a systolic manner, with A„ a 
multiplicand, times B, a multiplier, over a modulus, /V, and a result being at most 
In + 1 characters long, including the k first emitting disregarded zero characters, 
which are not saved, where k characters have no less bits than the modulus, 
wherein said operations are carried out in two phases, the apparatus comprising; 

a first (5), and second (AO main memory register means, each register operative to 
hold at least n bit long operands, respectively operative to store a multiplier value 
designated 5, and a modulus, denoted N, wherein the modulus is smaller than T\ 

a digital logic sensing detector, Y 0 , operative to anticipate "on the fly" when a 
modulus value is to be © added to the value in the © adder accumulator device such 
that all first k characters emitting from the device are forced to zero; 

a modular, multiplying device for at least k character input multiplicands, with 
only one, at least k characters long © adder, © summation device operative to 
accept k character multiplicands, the ® multiplication device operative to switch 
into the © accumulator device, in turn, multiplicand values, and in turn to 
receive multiplier values from a B register, and an "on the fly" simultaneously 
generated anticipated value as a multiplier which is operative to force k first 
emitting zero output characters in the first phase, wherein at each effective 
machine cycle at least one designated multiplicand is © 

added into the © accumulation device; 

the multiplicand values to be switched in turn into the © accumulation device 
consisting of one or two of the following three multiplicands, the first 
multiplicand being an all-zero string value, a second value, being the 
multiplicand A- u and a third value, the /V 0 segment of the modulus; 



supcmup -l 



12 



11/ -1:02 PM 



an anticipator to anticipate the /bit k character serial input Y 0 multiplier values; 

the apparatus being operable to input in turn multiplier values into the 
multiplying device said values being first the B operand in the first phase, and 
concurrently, the second multiplier value consisting of the Y Q , "on the fly" 
anticipated k character string, to force first emitting zeroes in the output; 

the apparatus furthec_co.mprising_an_accumulation-device -0-,- operative -to- 

output values simultaneously as multiplicands are © into the © accumulation 
device; 

an output transfer mechanism, in the second phase operative to output a final 
modular ® multiplication result from the © accumulation device. 

According to a preferred embodiment © summations into the © accumulation 
device are activated by each new serially loaded higher order multiplier 
characters. 

Preferably, the multiplier characters are operative to cause no © summation 
into the © accumulation device if both the input B character and the 
corresponding input Y 0 character are zeroes; 

are operative to © add in only the A,- multiplicand if the input B character is a 
one and the corresponding Y 0 character is a zero; 

are operative to © add in only the N, modulus, if the B character is a zero, and 
the corresponding Y 0 character is a one; and 

are operative to © add in the © summation of the modulus, N, with the 
multiplicand A, if both the B input character and the corresponding Y 0 character 
are ones. 

Preferably, the apparatus is operative to preload multiplicand values A\ and N, 
into two designated preload buffers, and to © summate these values into a third 
multiplicand preload buffer, obviating the necessity of ©adding in each 
multiplicand value separately. 
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Preferably, the multiplier character values are arranged for input in serial 
single character form, and wherein the Y 0 detect device is operative to anticipate 
only one character in a clocked turn. 

In a preferred embodiment wherein the © accumulation device performs 
modulo 2 computations, XOR addition/subtraction, wherein all carry bits in 
addition and subtraction components are disregarded, thereby precluding 
p rovisions for overflow and further lim iting convergent in rnmpnrnrjnns 



Preferably, carry inputs are disabled to zero, denoted, if=Q t typically operative 
to perform polynomial based multiplication. 

Preferably the apparatus is operative to provide non-carry arithmetic by 
omitting carry circuity, such that an if equal to zero acting on an element in a 
circuit equation computing in GF(2 ? ), if designates omitted circuitry and all 

adders and subtractors, designated © to XOR, modulo 2 addition/subtraction 
elements. 

A preferred embodiment is adapted such that the first k character segments 
emitted from the operational unit are zeroes, being controlled by the following 
four quantities in anticipating the next in turn Y 0 character: 

i the /bit S om bits of the result of the /bit by /bit mod 2'® multiplication 

o f the 
right-hand character of the A\ register times the B d character of the B Stream, 
A 0 -® B d mod T: 

ii the first emitting carry out character from the © accumulation device, 

iii the /bit 5 0ut character from the second from the risht character emitting 
cell of the © accumulation device, SO\ ; 

iv the /bit J 0 value, which is the negative multiplicative inverse of the 
right-hand 

character in the N 0 modulus multiplicand register. 
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wherein values, A Q ® B d mod 2 l , ^{CO 0 ), and 50 1 are © added character to 
character together and "on the fly" multiplied by the J 0 character to output a 
valid Y Q zero-forcing anticipatory character to force an /bit egressing string of 
zeroes. 

The apparatus is preferably operable to perform multiplication on 
polynomial based operands performed in a reverse mode, multiplying from right 
hand MS characters to left hand LS characters, operative to perform modular 
reduced ® multiplication without Montgomery type parasitic functions. 

Preferably, the apparatus further comprises preload buffers which are serially 
fed and where multiplicand values are preloaded into the preload buffers on the 
fly from one or more memory devices. 

The apparatus is preferably operative to © summate into a multiplication 
stream a previous value, emitting from an additional n bit S register, via an I bit 
© adder circuit such that first emitting output characters are zeroes when the Y 0 
detector is operative to detect the necessity of © adding moduli to the © 
summation in the © accumulation device, wherein the Y Q detector operates to 
detect utilizing the next in turn © added characters A 0 - ® 5 d mod 2 l t y(C6> 0 ), 
SO\ t S d and cf(CO z ), the composite of © added characters to be finite field ® 
multiplied on the fly by the /bit J 0 value, where © defines the addition and ® 
defines the multiplication as befits the finite field used in the process. 

Preferably, for /= 1, J 0 is implicitly 1, and the J 0 ® multiplication is implicit, 
without additional hardware. 

Preferably, a comparator is operative to sense a finite field output from the ® 
modular multiplication device, whilst working in GF(p), where the first right 
hand emitting k zero characters are disregarded, where the output is larger than 
the modulus, yV, thereby operative to control a modular reduction whence said 
value is output from the memory register to which the output stream from the 
multiplier device is destined, and thereby precluding allotting a second memory 
storage device for the smaller product values. 
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Preferably, for ® modular multiplication in the GF(2*) f the apparatus is 
operative to multiply without an externally precomputed more than /bit zero- 
forcing factor. 

A preferred embodiment is operative to compute a J 0 constant by resetting 
either the A operand value or the B operand value to zero and setting the partial 
result value, So, to 1. 

According to a second aspect ofThe present invention there is provided a 
microelectronic apparatus for performing interleaved finite field modular 
multiplication of integers A and B t so as to generate an output stream of A times 
B modulus N, wherein a number of characters in a modulus operand register, 
is larger than a segment length of k characters, wherein the ® multiplication 
process is performed in a plurality of interleaved iterations, wherein at each 
interleaved iteration with operands input into a ® multiplying device, consisting 
of N, the modulus, B, a multiplier, a previously computed partial result, S, and a 
k character string segment of A, a multiplicand, the segments progressing from 
the A 0 string segment to the A m .\ string segment, wherein each iterative result is 
© summated into a next in turn 5, temporary result, in turn, wherein first 
emitting characters of iterative results are zeroes, the apparatus comprising: 

first (£), second (S) and third (/V) main memory registers, each register respectively 
operative to store a multiplier value, a partial result value and a modulus; 

a modular multiplying device operative to © summate into the © accumulation 
device, in turn one or two of a plurality of multiplicand values, during each one 
of a plurality of phases of the iterative ® multiplication process, and in turn to 
receive as multipliers, in turn, inputs from: 

said B register, 

an "on the fly" anticipating value, Y 0 , being usable as a multiplier to force first 
emitting right-hand zero output characters in each iteration, and 
said /V, register; 
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the multiplicand parallel registers operative at least to receive in turn, values 
from the A, ZJ, and /V register sources, and in turn, also a multiplicand zero 
forcing Yo, value; 

the apparatus further utilizing the Y 0 detect device operative to generate a 
binary string operative to be a multiplier during the first phase and operative to 
be a multiplicand in the second phase; 

the appa ratus be ing operabl e to obtain mult i pjic an d va lu es, suitable fo r 
switching into the © accumulation device for the first phase consisting of a first 
zero value, a second value, A\, which is a k character string segment of a 
multiplicand, A, and a third value No> being the first emitting k characters of the 
modulus, N; 

the apparatus further being operable to utilize a temporary result value, S, 
resulting from a previous iteration, operative to be © summated with the value 
emanating from the © accumulation device, to generate a partial result for the 
next-in-turn iteration; 

the apparatus further being operable to utilize multiplicand values to be input, 
in turn, into the © accumulation device for a second multiplication phase being, 
comprising firstly a first zero value, a second A\ operand, remaining in place 
from the first phase, and thirdly a Yo value having been anticipated in the first 
phase; 

multiplier values input into the multiplying device in the first phase being 
firstly an emitting string, Bq, said multiplication device being operable to 
multiply said string segment concurrently ® multiplying with a second ® 
multiplier value consisting of the anticipated Yo string which is simultaneously 
loaded character by character as it is generated into a preload multiplicand buffer 
for the second phase; 

the two multiplier values operative to be input into the apparatus during a 
second phase being left hand n- k character values from the B operand, 
designated 5, and the left hand n - k characters of the /V modulus, designated /V, 
respectively; and 
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wherein said apparatus further comprises a multiplying flush out device 
operative in a last phase to transfer the left hand segment of a result value 
remaining in the © accumulation device into a result register. 

Preferably, the apparatus is operable to perform ® multiplication on 
polynomial based operands is performed in a reverse mode, multiplying from 
MS characters to LS characters, operative to perform modular reduction without 
the Mont gomery typ e parasiti c fun ctions, as j n^ap pj i c an tls.pate n iJU.S-5 J42,5 30. 

According to a third aspect of the present invention there is provided apparatus 
operative to anticipate a Y 0 value using first emitting values of the multiplicand, 
and present inputs of the B multiplier, carry out values from the ® accumulation 
device, © summation values from the © accumulation device, the present values 
from the previously computed partial result, and carry out values from the 
© adder which © summates the result from the © accumulation device with the 
previous partial result. 

Preferably, the apparatus is adapted to insure that k first emitting values from 
the device are zero characters said adaptation comprising anticipation of a next 
in turn K 0 character using the following quantities: 

i the /bit S oul bits of the result of the /bit by /bit mod 2 £ ® multiplication 

of the right-hand character of the A { register times the B d character of the B 
Stream, A 0 -B d mod 2 l \ 

ii the first emitting carry out character from the © accumulation device, 

iii the /bit 5 0u[ character from the second from the right-hand character 
emitting cell of the © accumulation device, SO\\ 

iv the next in turn character value from the S stream, St\ 

v the / bit carry out character from the Z output full adder, if{CO z )\ 

vi the /bit 7 0 value, which is the negative multiplicative inverse of the 
right-hand character in the /V 0 modulus multiplicand register; 
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wherein values, A 0 'B d mod 2 l t tf(CO 0 ), SO u S d are © added character to 
character together and "on the fly" ® multiplied by the J Q character to output a 
valid Y 0 zero-forcing anticipatory character. 

In a further embodiment there is also provided at least one sensor operative to 
compare the output result to N, the modulus, the mechanism operative to actuate 
a second subtracter on the output of the result register, thereby to output a 
— modular-reduced-value-whrch-r^ 

precluding the necessity to allot a second memory storage for a smaller result. 

In a yet further embodiment, a value which is a © summation of two 
multiplicands is loaded into a preload character buffer with at least a k characters 
memory means register concurrently whilst one of the first values is loaded into 
another preload buffer. 

According to a fourth aspect of the present invention there is provided 
apparatus with one © accumulation device, and an anticipating zero forcing 
mechanism operative to perform a series of interleaved ® modular 
multiplications and squarings and being adapted to perform concurrently the 
equivalent of three natural integer multiplication operations, such that a result is 
an exponentiation. 

In an embodiment, next-in-turn used multiplicands are preloaded into a 
preload register buffer on the fly. 

In a further embodiment, apparatus buffers and registers are operative to be 
loaded with values from external memory sources and said buffers and registers 
are operative to be unloaded into the external memory source during 
computations, such that the maximum size of the operands is dependent on 
available memory means. 

In a yet further embodiment there is also provided memory register means, 
said memory means are typically serial single character in/ 
serial single character out, 

parallel at least k characters in/parallel at least k characters out, 
serial single character in/parallel at least k characters out, and parallel k 
characters in/serial single character out. 
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Preferably the apparatus is operable to provide, during a Final phase of a 
multiplication type iteration, the multiplier inputs are zero characters which are 
operative to flush out the left hand segment of the carry save ©accumulator 
memory. 

Preferably, the apparatus is operable to preload next in turn multiplicands into 
preload memory buffers on the fly, prior to their being required in an iteration. 

Preferably, the apparatus is operable to preload multiplicand values into 
preload buffers on the fly from central storage memory means. 

The same device is operable to compute the k character Montgomery constant 
Jo, related to the right hand k character segment of a modulus preferably by 
resetting both A and B to zero and setting So = 1, whilst subsequently performing 
a k bit multiplication. The result will reside in the Yq register. 

Modular Multiplication Sequences using Montgomery type Arithmetic 

The k character carry save adder, the CSA, is the basis for the serial/parallel 
superscalar modular multiplication in both the polynomial field and in the prime 
number field. Polynomial G¥(2 q ) based computations are executed preferably 
with all carry mechanisms switched off. 

The Serial - Parallel Super Scalar Montgomery Multiplier computes 
Montgomery modular product in three phases, wherein in one preferred 
embodiment, the last phase may be a single clock dump of the whole left hand k 
character segment with a carry of the CSA (MS for normal multiplications, LS 
for reverse mode polynomial computations), and in a more compact 
embodiment, the last phase may be a k effective clock serial flush out of the 
contents of the CSA. 

In previous P2 disclosures, the Yq factor was computed bit by bit, 
consequently, only the right hand bit of J 0 was consequential, and by definition a 
one, and a function of the right hand bit of the modulus. In this enhancement 
device, the device is character serial, and an I bit character Yo is generated at 

each clock cycle. As in previous PI disclosures, Yq is the first phase zero forcing 
function which adds in the value of modulus a necessary number of times into 
the accumulated result, so that the relevant answer is congruent and never longer 
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than mk+l characters, and the first right hand emitted characters were all zeroes 
X+QN = X,as <2/V=0mod/V. 

Modular Multiplication Sequences 

Prior to initiating computation, we assume that there are no previous temporary 
or random values in the device; that the operands N, £, and at least the first 

s egment value of A are a vailable in the registers of the device S Q = n : rhe partial 

product at initiation is typically zero. Typically, modular arithmetic is executed 
on operands consisting of two or more k character segments, typically in three 
distinct phases. For a normal full multiplication, where there are m segments of 
the modulus, there are typically m superscalar multiplication interleaved 
iterations whence each segment of the multiplicand is multiplied by the total 
multiplier, typically, 5. 

The process of the first phase, on the (i = O'th segment) of each interleaved 
superscalar multiplication is the generic superscalar multiplication accumulation 
interaction: 

Si ® A r B 0 © Yo-No 

(B 0 and Y 0 are serially character by character fed from the first segment of the 
operand into the multiplier, A, and N 0 are parallel single slice operands, Si is the 
partial product from a previous iteration/computation. 5/ = 0 on the O'th first 
iteration.) 

The first phase process implements a © summation of the two superscalar 
products with the right hand segment of the previous result. A k character string 
of zeroes has emitted from the multiplying device, and is disregarded; a partial 
first segment result resides in the device buffer, which is summated into the 
second phase result. 

The first phase result consists typically of R 7 the contents of the CSA 
concatenated with the serial outputed right hand segment of all zeroes. (In GF(p) 
computations there is an additional LS carry-in bit to R.) 
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The process of the second phase is the generic superscalar multiplication 
accumulation interaction: 

R@Si® A r B 0 Yo-N 

(Remember, an underlined variable, e.g ., B , is the remainin g left hand value 
of an operand. It is typically one or more segments, i.e., m - 1 segments. B and 
N are serially fed character by character into the multiplier, and both A i7 
remaining from the first phase, and Y 0 , which was a multiplier in the first phase 
and was loaded into the machine in the first phase to be a multiplicand in 
subsequent iterations, are parallel operands). 

At the end of the second phase, typically consisting of m-l iterations, the left 
hand segment of 5,- remains in the CSA - ready to be transferred - and the ri^ht 
hand slice (k character segment) has emanated from the device, typically into an 
S register. Note that multiplication in the prime number field has been performed 
in a conventional carry save summation method. Multiplication in the GF(2 q ) 
reversed format mode has progressed from most to least significant characters. 
The Yq function has anticipated when a modulus value must be "added" into the 
accumulator. Except for the disabled carry bits in the device, the mechanical 
process is typically identical for the two number systems. 

We disclose the method used for deriving the Yqj characters of the zero-forcing 
vector in finite fields. 

Compute: J 0J = -Nq 0 ' { mod 2 l . 

All natural integers which are relatively prime to a modulus have multiplicative 
inverses in both number fields. Noo is odd and, therefore, has no factor of 2. All 
factors of mod 2 L are 2, so that any number with a least significant 1, and a 

modulus whose only factor is 2, are relatively prime, and J 0} always exists. 
Formally, for odd Nqq and 2 l : gcd(yV 0 o,2 / ) = 1. 
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This single character of the function can be hardwire implemented with 
random logic, with simple circuitry, or with a simple look up table. There are 
only 2*" 1 different values that must be derived in a look up table. In the reverse 
mode format, the polynomial modulus must be right justified ; nominally odd. In 
typical exponentiational functions, both number fields, the right hand bit of the 
modulus bit is a one, nominally odd, and the multiplicative inverse of an odd 

— number-mod-3^-mu5eaKvays-be-odd: ._.„ ■ .. 

If /= 1 , the Jqq multiplier is explicitly equal to 1 , and need not be computed. 

The result of a character output forced by the Y 0 function during the first phase 
is always 0, consequently the y'th character output, Zij\ of the SuperMAP is: 

0 = (2 £ R® Sij ® Aio-Boj © Yqj'Nqq ) mod 2 L =Z i} \ therefore, 
(R® Sij © AwBoj) = -Yqj-Nqo ; and 
Y 0J = -iVqq { (R®Sij © Aio-Bqj) mod r. 

From the above equation, we learn that Too is preferably the negative value of 
the modular multiplicative inverse of the right hand k character of the modulus 
for both number systems; noting that in modulo 2 arithmetic, positive and 
negative values are the same. 

R is the summation of the value remaining in the CSA summated to the carry 
out bit from the final serial adder, 460, in Fig. 2. S ( y is the /th bit of the partial 
product at the f th iteration. A,o*is the right hand character (LS in GF(p)) of the 
/'th slice of A. Boj is the /th character of B. Bq is a constant (multiplier) during all 
iterations of a Montgomery multiplication. Yq is a k character vector generated at 
each (/'th) iteration. Yqj is /th character generated at the /th clock of the first 
phase of an iteration. /V, is an m sliced modulus. Nq is the right hand slice of the 
modulus. Nqq is the right hand character of Nq. 

Formalizing the Super scalar modular multiplication method for both fields: 

5 0 = 0; 

For / = 0 to m - 1 (interleave iterations) 
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First phase: (of each interleave) 
/? = 0 

For j = 0 to k - 1 (each character of first phase) 

Y 0J = ( Jqo- (R © S 0J © A i0 ® -S 0 y ) ) mod 2 £ 

Zij = (/? © 5,j © A (0 ®-flo/ © YqJ®-Nq ) mod 2 L : and 

^7(2' * © 5 (J © /\, 0 ®*So, © y 0 /®^ 0 )] / 2 C ™~~~ ~ " 

After k effective clock cycles, the first segment of the Z stream was all zeroes, 
and was disregarded; the relevant Yq 7 k character vector, is now prepared to be a 
multiplicand in the next phase, and the summated R value will be used in the 
next phase. 

Second phase: 
For j = k to n - 1 

Zij = (R® S l} ® A i0 ®B 0 j © Yaj®Nnj) mod 2 £ ; 
R = [2 l R © © Aio-Boj © Yoj-Noj] I 2 C 

Implementation of the above algorithm with a character based serial-parallel 
multiplier is a simple extension of the above protocol: 

{Quotient^ x, y ) is the integer division function without remainder. For 
example, if x = 10101 b , and y = 10000 b , then Quotient{ x, y ) = 1 ) . 

S Q = 0 

For i - 0 to m - 1 (Interleaved loop) 

First phase: 
For j = 0 to k - 1 

*oj = ( Jqq- 

(S i0 © A l0 ®B 0} © J^Cflrryo © Sum x © Quotient^ S i0 © Shahq, r ))) mod r 
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For t = 0 to k - 1 ( Whole loop with 1 clock pulse ) 

Sum { = ( Sum { + { © yCarry x © A lt ® 5 0 j © y 0 j ® M)i ) mod r 
Carry t = ( Quotient((Sum l+l © Carry t © A it ® 5 0 j ® Y 0j ® /V 0t ), r ) 
(Output of multiplier in this stage is T)' s) 
Second phase: 

'Main 'pan 

Carry z = 0 
For y = & to /i - 1 

For r = 0 to £ - i ( Whole loop with 1 clock pulse ) 

Sum { - (Sum l+ i © y Carry t © A it ®5j © ln r®/Vj ) mod r 
Carry t = Quotie?it{{Sum^ x © Carryt © A ir ®fl, © Jot®A(j), r) 

5i,j-k = (Si j_2k © Sumo © J^Carry a ) mod r 
Carry z = Quotient((Sij.2k © Sumo © Carryt, r) 

Flushing of the multiplier 
For y = ;i to ( n + & - 1 ) 

For r = 0 to £ - 1 ( Whole loop with 1 clock pulse ) 

Sum^ ~ (Sum t +[ © y Carryt mod r 
Carry \ = Quotient((Su?n [+[ © Carry t ), r) 

5ij.k = (Sij.ik © 5wm 0 © y-Carry^) mod r 
Carry^ = Quorient((S\.j.2k © S«mo © Carrv a ), r) 

For a formal explanation with examples of the particular case where i =1 in the 
GF(p) field, see PL 

The above describes a microelectronic method and apparatus for performing 
interleaved finite field modular multiplication of integers A and B operative to 



supenr-ip X 



25 



11/13/2000 4:02 PM 



generate an output stream of A times B modulus N wherein n is the number of 
characters in the modulus operand register and is larger than k y wherein the ® 
multiplication process is performed in iterations, wherein at each interleaved 
iteration with operands input into a ® multiplying device, consisting of N, the 
modulus, 5, a multiplier, a previously computed partial result, 5, and a k 
character string segment of A, a multiplicand, the segments progressing from the 
_Ao-String_segmenL_La_the^ m ,[^ 
© summated into a next in turn S, temporary result, in turn, wherein first 
emitting characters of iterative results are zeroes, the apparatus comprising: 

Typically, there may be four serial L bit character registers feeding the multiplying 

device, first (5), second (5) and third (AO and preferably (A), configured to efficiently 
load the multiplier. For computations on long operands which typically are not 
accommodated in the MAP's internal registers, the CPU can load operands from its 
accessable memory. 

Typically, these main memory registers store and output operands, respectively 
operative to store a multiplier value, a partial result value and a modulus, N. 

The modular multiplying device operative to ©summate into the 
© accumulation device, in turn one or two of a plurality of multiplicand values, 
in turn, during the phases of the iterative ® multiplication process, and in turn to 
receive as multipliers, in turn, inputs from a first value B register, second, from 
an "on the fly" anticipating value, Y 0j as a multiplier to force first emitting right- 
hand zero output characters in each iteration, and third values from the modulus, 
N, register. 

The multiplicand parallel registers are operative to receive in turn, values from 
the A, B y and N register sources, and in turn, also a multiplicand zero forcing ?o> 
value. 

The zero forcing Yq detect device is operative to generate a binary string 
operative to be a multiplier during the first phase of operation and is operative to 
be a multiplicand in the second phase of each iterative multiplication. 
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The multiplicand values to be switched into the © accumulation device for the 
first phase consist can be on of four values, a First zero value, a second value, A\ 
which is a k character string segment of a multiplicand, A, and a third value /V 0 , 
being the first emitting k characters of the modulus, N. The /V 0 value is typically 
switched in at the start of a multiplication, if there is a fourth preload buffer as in 
Fig. 6. Then when a k character slice of A is input, the A\ value is serially 

^ummated-wixh-the-yVo-value~and-stored-in-the~fourth-buffeF; 

If a computation is typically on a single k character modulus, then there is no 
need for the 5 register, or for the temporary result value, S. If the operand is 2k 
characters or longer, then the manipulations must be iterative, with progressing 
A\ slices. For squaring operations slices of B are typically snared from the B 
stream on the fly and preloaded into the A\ preload buffer. 
At the first iteration of a multiplication procedure, the temporary result is zero. 
Subsequent temporary results from previous iterations, are operative to be 
© summated with the value emanating from the © accumulation device, to 
generate a partial result for the next in turn iteration; 

The multiplicand values to be input, in turn, into the © accumulation device for 
the second phase being, a first zero value, which is a pseudo register value, a 
second A\ operand, remaining in place from the first, phase, and a third Yq value 
having been anticipated in the first phase operative to continue multiplying the 
remaining characters of the N modulus. 

The multiplier values input into the multiplying device in the first phase being 
a first emitting string, So, being the first emitting string segment of the B 
operand, concurrently ® multiplying with the second ® multiplier value 
consisting of the anticipated Y 0 string which is simultaneously loaded character 
by character as it is generated into a preload multiplicand buffer for the second 
phase; 

The two multiplier values input into the apparatus during the second phase 
being the left hand n - k character values from the B operand, designated 5, and 
the left hand n - k characters of the N modulus, designated N. respectively. 
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The third phase is a flush out of the device operative to transfer the [eft hand 
segment of a result value remaining in the © accumulation device. This can 
either be a single clock data dump, or a simple serial unload, driven by zero 
characters fed in the multiplier inputs. 

If the dump is a parallel dump, some means for comparing to decide if the 
result demands an additional reduction by the modulus. 

One of the more innovative enhancements in the invention, is the reverse mo de 
multiplication in the GF(2 (/ ). Because of the lack of interaction between adder 
cells in this arithmetic, it is possible to perform multiplication and reduction 
starting from the MS end of the product, thereby having a product that is the 
modular reduced answer, without a burdensome parasite, caused by disregarded 
zeroes which, are tantamount to performing a right shift in conventional 
Montgomery multiplication. 

The second innovation that allows for automatic zero forcing is an extension of 
the Yq function of patent application P2, which describes a device wherein only 
one bit was anticipated at a time. There the J 00 bit, only, multiplied the single bit 
XORed values. Both the multiplicative inverse of an odd number and its 
negative value produce an odd number. This saved implementing a look up table 
or a random logic circuit to compute the J 0 value for /= 1. Note, Jo is a different 

quantity in non-alike number systems. We have shown in this extension how a 
Y 0 value can be derived, for both relevant number fields. 

The following describes the elements of the circuitry operative to anticipate the 
Yq value using first emitting values of the multiplicand, and present inputs of the 
B multiplier, carry out values from the © accumulation device, © summation 
values from the © accumulation device, the present values from the previously 
computed partial result, and carry out values from the © adder which 
© summates the result from the © accumulation device with the previous partial 
result. 

Stated differently, the six values operative to control the zero forcing function, 
are: 
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i the /bit 5 0ut bits of the result of the /bit by / bit mod 2 U ® multiplication 

of the 
right-hand character of the A\ register times the B d character of the B Stream, 
A 0 ®B d mod 2 l \ 

ii the first emitting carry out character from the © accumulation device, 
y(Q9o); 

iii the / bit 5 out character from the second from the nght-hand character 
emitting cell of the © accumulation device, SO\ \ 

iv the next in turn character value from the S stream, 5 d ; 

v the /bit carry out character from the Z output full adder, tf{CO z )\ 

vi the /bit J 0 value, which is the negative multiplicative inverse of the 
right-hand character in the Nq modulus multiplicand register; 

wherein values, A 0 ®B d mod 2 l , ^f(COo), SO\, S d are © added character to 
character together and "on the fly" ® multiplied by the J 0 character to output a 
valid ^0 zero-forcing anticipatory character to force an /bit egressing character 
string of zeroes. 

Just as in PI, in order to determine if an output must be modular reduced, a 
sensor operative to compare the output result to N y the modulus, the mechanism 
operative to actuate a second subtractor on the output of the result register, 
thereby to output a modular reduced value which is limited congruent to the 
output result value precluding the necessity to allot a second memory storage for 
a smaller result. 

The single © accumulation device, configured to perform multiplication, and 
an anticipating zero forcing mechanism together are operative to perform a 
series of interleaved ® modular multiplications and squarings. The total device 
performs the equivalent of three integer multiplications, as in a conventional 
Montgomery method, 7 0 is a k character device multiplying the first k character 
summation of Bq®A\ and S u and in finally using the Y 0 to multiply /V. 
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Whilst the SuperMAP is computing the last iteration of a multiplication, the 
first slice of a next multiplication can be preloaded into a preload register buffer 
means on the fly. This value may be the result of a previous multiplication or a 
slice of a multiplicand residing in one of the register segments in the register 
bank of Fig. 1 or Fig. 5. 

The preloaded value which is a ©summation of two multiplicands is 
© summated into a k character register, only, for G¥{2 q ) computations. In GF(p) 
computations, provision must be made for an additional carry bit. 

Especially for very long moduli, buffers and registers adjacent to the 
SuperMAP typically have insufficient memory resources. Means for loading 
operands directly into preload buffers is provided, operative to store operands in 
the CPU's memory map. For reverse format multiplication, bit order of input 
words from the CPU are typically reversed in the Data In and Data Out devices. 

BRIEF DESCRIPTION OF THE DRAWINGS 

In the drawings: 

Thick lines designate k character (Id bit) wide parallel bus lines. Thinner 

contiguous signal lines depict I bit wide lines. Most control lines are not 

depicted; those that are included are typically necessary to understand 
procedures and are typically drawn as dash-dot-dash lines 

Fig. I is a block diagram of the apparatus according to an embodiment of the 
invention where four main registers are depicted and the serial data flow path to 
the operational unit is shown and the input and output data path to the host CPU 
of Fig. 3; 

Fig. 2 is a block diagram of an embodiment of an operational unit operative to 
manipulate data from Fig. 1; 

Fig. 3 is a simplified block diagram of a preferred embodiment of a complete 
single chip, monolithic cryptocorriputer, typically in smart cards; 
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Fig. 4 is a simplified block diagram of a preferred embodiment of a complete 
single chip monolithic cryptocomputer wherein a data disable switch is operative 
to provide for accelerated unloading of data from the operational unit; 

Fig. 5 is a simplified block diagram of a data register bank, operative to 
generate 7 0 ; 

Fig. 6 is a simplified block diagram of an operational unit, wherein the Y Q sense 
— Fs-a-deviee-operatrve-to-force-a-zero-firsr phase-output; 

Fig. 7A is a block diagram of the main computational part of Fig. 6, with 
circled numbered sequence icons relating to the timing diagrams and flow charts 
of Figs. 7B ; 7C, and Fig. 7D; 

Fig. 7B is an event timing pointer diagram showing progressively the process 
leading to and including the first iteration of a squaring operation; 

Fig. 7C is a detailed event sequence to eliminate the "Next Montgomery 
Squaring" delays in the first iteration of a squaring sequence iconed pointers 
relating to Fig. 7A, Fig. 7B, and Fig. 7D; and 

Fig. 7D illustrates the timing of the computational output, relating to Fig. 7A, 
7B, and Fig. 7C. 

Figs. 8A and 8B, taken together describe generation of the Y 0 vector in GF(2*) 
and in GF(p). 

Fig. 8A is a set of look up tables for determining the negative multiplicative 
inverse of the right hand character of No, for /= 2 and /= 4. 

Fig. 8B describes, in simplified block form the signals that generate the Y Q 
function for /= 2 and /= 4 in both number fields. 

DESCRIPTION OF PREFERRED EMBODIMENTS 
In the drawings: 

Thick lines designate k character (kl bit) wide parallel bus lines. Thinner 
contiguous connecting signal lines depict L bit wide lines. Typically, control 
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lines are not depicted; those that are preferably necessary to understand 
procedures, are typically drawn as dash-dot-dash lines 

Figs. I - 2, taken together, form a simplified block diagram of a serial-parallel 
arithmetic logic unit (ALU) constructed and operative in accordance with a 
preferred embodiment of the present invention. The apparatus of Figs. 1 - 2, 
preferably include the following components: • • 

Single Multiplexers - Controlled Switching Elements which select one signal 
or character stream from a multiplicity of inputs of signals and direct it this 
chosen signal to a single output. Multiplexers are marked Ml to M13, and are 
intrinsic parts of larger elements. 

The Multiplexer and pre-adder, 390, is an array of k I + 1 multiplexers, and 

chooses which of the four k or k + 1 character inputs are to be added into the 
CSA,410. 

The B (70) and 80), S A (130), S B (180), and N (200) and (210) are the four 
main serial main registers in a preferred embodiment. The S A is conceptually 
and practically redundant, but can considerably accelerate very long number 
computations, and save volatile memory resources, especially in the case where 
the length of the modulus is 2-k-m characters long. 

Serial Adders and Serial Subtractors are logic elements that have two serial 
character inputs and one serial character output, and summate or perform 
subtraction on two long strings of characters. Components 90 and 500 are 
subtractors, 330, and 460 are serial adders. The propagation time from input to 
output is very small. Serial subtractors 90 and 500 typically reduce 5* to B if 5* 
is larger than or equal to N and/or S* to S if S* is larger than or equal to N. 
Serial Subtractor 480, is used, as pan of a comparator component to detect if 5* 
will be larger than or equal to /V. Full Adder 330, adds the two character streams 
which feed the Load Buffer 340, with a value that is equal to the sum of the 
values in the 290 and 320 Load Buffers. 

Fast Loaders and Unloaders, 10 and 20, and 30 and 40, respectively, are 
devices to accelerate the data flow from the CPU controller. Typically, these 
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devices eliminate the necessity for other direct memory access components. ^0 
and 40 are for reversing the data word, as is necessary for reversing the data 
words for reverse format GF(2^) multiplications. 

Data In, 50, is a parallel in serial out device, as the present ALU device is a 
serial fed systolic processor, and data is fed in, in parallel, and processed in 
serial. 

- „Data Out ,_60 , Js_a„seri a.U n paral le Lxiiiud£v.Lc.e. ? for ou tputti ng_res ul ts_f ro m-the- 

coprocessor. The quotient generator is that part of Fig. 2, which generates a 
quotient character at each iteration of the dividing mechanism. 

Flush Signals on Bd, 240; on S*d, 250; and on Nd, 260, are made to assure that 
the last fc+ 1 characters can flush out the CSA. A second embodiment would 
reconcile the R data at the end of the second phase, and would perform a single 
parallel data dump to flush out the CSA. 

Load Buffers Rl, 290; R2 7 320; and R3, 340 are serial in parallel out shift 
registers adapted to receive the three possible more than zero multiplicand 
combinations. 

Latches LI, 360; L2, 370; and L3, 380; are made to receive the outputs from 
the load buffers, thereby allowing the load buffers, the temporal enablement to 
process the next phase of data before this data is preferably latched into LI, L2, 
and L3. Latch L0 is typically a "virtual" constant all zero input into 390, which 
typically is not implemented in latched logic. 

Yq Sense, 430, is the logic device, which determines the number of times the 
modulus is accumulated, in order that a k character string of LS zeros will exit at 
Z in ® multiplications. 

One character delay devices 100, 220 and 230 are inserted in the respective 
data streams to accommodate for computation synchronization between the data 
preparation devices in Fig. L and the data processing devices in Fig. L 

The k character delay, shift register, 470, the result after disregarding zero 
output strings synchronized N for the larger than N comparison. 

The Carry Save Accumulator is almost identical to a serial/parallel multiplier, 
excepting for the fact that three different larger than zero values can be 
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summated, instead of the single value as conventionally is latched onto the inpul 
of the s/p multiplier. When used in polynomial based computations "all carr> 
dependent" functions are disabled. 

The Insert Last Carry, 440, is used to insert the (nvk-l+ l)\h bit of the S stream, 
as the S register is only mk characters long. 

The borrow/overflow detect, 490, typically detects if a result is larger than or 
eq ual to the modulus (from /V), or in GF(p) compu tations. In polynomial based 
computations the overflow is detected if the first significant result bit is a one. 

The control mechanism is not depicted, but is preferably understood to be a set 
of cascaded counting devices with finite state machines for specific functions 
with switches set for systolic data flow in both G¥(p) and GF(2 q ). 

For modular multiplication in the prime and composite prime field of numbers, 
we define A and B to be the multiplicand and the multiplier, and N to be the 
modulus which is typically larger than AovB.N also denotes the register where 
the value of the modulus is stored. TV, may, in some instances, be smaller than A. 
We define A, B, and N as m-k~n character long operands. Each k character 
group will be called a segment, the size of the group defined by the size of the 
multiplying device. Then A 7 5, and N are each m characters long. For ease in 
following the step by step procedural explanations, assume that A, B, and N are 
512 bits long, = 512); assume that k is 64 characters long because of the 
present cost effective length of such a multiplier, and data manipulation speeds 
of simple CPUs; and m - 8 is the number of segments in an operand and also the 
number of iterations in a squaring or multiplying loop with a 512 bit operand. 
All operands are positive integers. More generally, A, B, N, /z, k and m may 
assume any suitable values. 

In non-modular functions, the yV and S registers can be used for temporary 
storage of other arithmetic operands. 

We use the symbol, =, to denote congruence of modular numbers, for example 
16 = 2 mod 7, and we say 16 is congruent to 2 modulo 7 as 2 is the remainder 
when 16 is divided by 7. When we write Y mod N = X mod N\ both Y and X may 
be larger than /V; however, for positive X and 7, the remainders will be identical. 
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Note also that the congruence of a negative integer 7, is Y + u N, where /V is the 
modulus, and if the congruence of Y is to be less than N, u will be the smallest 
integer which will give a positive result. 

We use the symbol, ¥, to denote congruence in a more limited sense. During 
the processes described herein, a value is often either the desired value, or equal 
to the desired value plus the modulus. For example X¥ 2 mod 7. X can be equal 
— !xr^or9rWe~say-A : -has^ based 
field, the analog is a monic value, which we say is larger than /V, and is reduced 
by XORing to the modulus. As in GF(2 q ), there is no overflow, this Yen value is 
typically disregarded. 

When we write X = A mod /V, we define X as the remainder of A divided by /V; 
e.g., 3 = 45 mod 7. 

In number theory the modular multiplicative inverse is a basic concept. For 
example, the modular multiplicative inverse of X is written as X'^, which is 
defined by X-X~ [ mod N = 1. If X=3, and yV=13, then X"'=9, i.e., the 
remainder of 3-9 divided by 13 is 1. 

The acronyms MS and LS are used to signify most significant and least 
significant when referencing bits, characters, segments, and full operand values, 
as is conventional in digital nomenclature. 

Throughout this specification /V designates both the value N, and the name of 
the shift register which contains N. An asterisk superscript on a value, denotes 
that the value, as stands, is potentially incomplete or subject to change. A is the 
value of the number which is to be exponentiated, and n is the character length 
of the /V operand. After initialization when A is "Montgomery normalized" to A* 
(A* = 2 ,l -A - to be explained later) A* and /V are constant values throughout the 
intermediate step in the exponentiation. During the first iteration, after 
initialization of an exponentiation, B is equal to A*. B is also the name of the 
register wherein the accumulated value, which finally equals the desired result of 
exponentiation resides. 5* designates a temporary value, and 5, S\ and 5b 
designate, also, the register or registers in which all but the single MS bit of S is 
stored. (S* concatenated with this MS bit is identical to 5.) S(i-l) denotes the 
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value of 5 at the outset of the i Xh iteration; S 0 denotes the LS segment of an 
5(0 'th value. 

We refer to the process in the GF(p) field (defined later) .P(A-B)N as 
multiplication in the P field, or sometimes, simply, a multiplication operation. 

As we have used the standard structure of a serial/parallel multiplier as the 
basis for constructing a double acting serial parallel multiplier, we differentiate 
~6etWeerTThe s u m m a n n g"'part " o f "t he"nTu 1 1 i p I i e r, wfTictTl^baseaTon carry save 
accumulation, (as opposed to a carry look ahead adder, or a ripple adder, the first 
of which is considerably more complicated and the second very slow), and call it 
a carry save adder or accumulator, and deal separately with the preloading 
mechanism and the multiplexer and latches, which allow us to simultaneously 
multiply A times B and C times D, while continuously summate both results with 
a previous result, S, e.g., A*5 + C D + S, converting this accumulator into a more 
versatile engine. Additional logic is added to this multiplier in order to provide 
for an anticipated sense operation necessary for modular reduction and serial 
summation necessary to provide for modular arithmetic and ordinary integer 
arithmetic on very large numbers. 

Montgomery Modular Multiplication in GF(p) 

The following description refers to Montgomery arithmetic in the GF(p) of 
numbers. The present device may be used for Montgomery arithmetic on 
polynomial based numbers in GF(2 q ), but would be degraded in performance, as 
computations would be in the P field, where all executable operands are 
multiplied by a factor of 2 n . 

In a classic approach for computing a modular multiplication, AB mod /V, the 
remainder of the product A B is computed by a division process. Implementing a 
conventional division of large operands is more difficult to perform than 
serial/parallel multiplications. 

Using Montgomery's modular reduction method, division is essentially 
replaced by multiplications using two precomputed constants. In the procedure 
demonstrated herein, there is only one precomputed constant, which is a 
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function of the modulus. This constant is, or can be, computed using this ALU 
device. 

A simplified presentation of the Montgomery process, as is used in this device 
is now provided, followed by a complete preferred description. 

If we have an odd number (an LS bit one), e.g., 1010001 (=81i 0 ) we can 
always transform this odd number to an even number (a single LS bit of zero) by 

— adding-~to~it--another^i-x4ng^ 
1111 + 1010001 = 1100000 (96io). In this particular case, we have found a 
number that produced five LS zeros, because we knew in advance the whole 
string, 81, and could easily determine a binary number which we could add to 
81, and would produce a new binary number that would have as many LS zeros 
as we might need. This fixing number must have a right hand one, else it has no 
effect on the progressive LS characters of a result. 

If our process is a clocked serial/parallel carry save process, where it is desired 
to have a continuous number of LS zeros, and wherein at each clock cycle we 
only have to fix the next bit, at each clock it is sufficient to add the fix, if the 
next bit were to be a one or not to add the fix if the anticipated bit were to be a 
zero. However, in order not to cause interbit overflows (double carries), this fix 
is preferably summated previously with the multiplicand, to be added into the 

accumulator when the relevant multiplier bit is one, and the Y Sense also 

anticipates a one. 

Now, as in modular arithmetic, we only are interested in the remainder of a 
value divided by the modulus, we know that we can add the modulus any 
number of times to a value, and still have a value that would have the same 
remainder. This means that we can add Y-N - Y. yvr-N to any integer, and still 
have the same remainder; Y being the number of times we add in the modulus, 
/V, to produce the required ki right hand zeros. As described, the modulus that we 

add can only be odd. (Methods exist wherein even moduli are defined as r times 
the odd number that results when / is the number of LS zeros in the even 
number.) 
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Montgomery interleaved reductions typically reduce storage requirements, and 
the cost effective size of the multiplication devices. This is especially useful 
when performing public key cryptographic functions where we multiply one 
large integer, e.g., n = 1024 bit, by another same length large integer; a process 
that would ordinarily produce a double lensth integer 

We can add in /Vs (the modulus) enough times to AB = X or AB + S = X during 

— the-process- of-multiplication- (or-squari-ng)-scrthat-we-wi ir have-a-numbeT7Zrth^ 
has n LS zeros, and, at most, n + 1 MS characters. 

We can continue using such numbers, disregarding the LS n characters, if we 
remember that by disregarding these zeros, we have divided the desired result 
by r". 

When the LS n characters are disregarded, and we only use the most significant 
n (or n + 1) characters, then we have effectively multiplied the result by r' n , the 
modular inverse of r a . If we would subsequently re-multiply this result by r n 
mod N (or r") we would obtain a value congruent to the desired result (having 
the same remainder) as A B + S mod /V. As is seen, using MM, the result is 
preferably multiplied by r 2n to overcome the r n parasitic factor reintroduced by 
the MM. 
Example: 

A-B + S mod/V= (12-11+10) mod 13 = (1 100101 1+1010) 2 mod 101 1 2 . 
/ = 1, r= 2 



We will add in 2' N whenever a fix is necessary on one of the ?i LS bi 



ts. 



B 1011 
X A 1100 
add S 1010 
add A(0) B 0000 

: sum. of LS bit = 0 not add N 

add 2° (JV-0) 0000 
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sum and shift 
adder 

add A(l) -B 

add 2 l (N-l) 
sum and shift 

""adder 

add A (2) -B 

add 2 2 (iV-0) 
sum and shift 
adder . 
add A (3) -B 

add 2 3 (iV-l) 
sum and shift 
adder 



0101 ->0 LS bit leaves carry save 
0000 

sum of LS bit = 0 - add AT 

1101 

1001 ->0 LS bit leaves CS 



1011 



0000 
1010 



sum LS bit = 0 don't add N 



->0 LS bit leaves CS 



1011 



1101 



10001 



sum LS bit = 1 add N 



->0 LS bit leaves CS 



And the result is 10001 0000 2 mod 13 = 17-2 4 mod 13. 



As 17 is larger than 13 we subtract 13, and the result is: 
17 ■ 2 4 s 4-2 4 mod 13. 



formally T\A-B + S) mod /V = 9 (12-11 + 10) mod 13 =4 

In Montgomery arithmetic we utilize only the MS non-zero result, 
4 and effectively remember that the real result has been divided by 2°; n zeros 
having been forced onto the MM result. 

We have added in (8+2)- 13 = 10-13 which effectively multiplied 
the result by 2 4 mod 13 s 3. In effect, had we used the superfluous zeros, we can 
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say that we have performed, A-S+ K-/V + S - (121 1 + 10-13+10) in one process, 
which will be described possible on a preferred embodiment. 

Check- (12-11+10) mod 13 = 12; 4 • 3 = 12. 

In summary, the result of a Montgomery Multiplication is the 
desired result multiplied by 2* n . 

To retrieve the previous result back into a desired result using the 

same mu jjipj [cation^ 

previous result by 2 2t \ which we will call H, as each MM leaves us with a 
parasitic factor of 2" n . 

The Montgomery Multiply function :P(A-B)N performs a 
multiplication modulo /V of the A-B product into the P field. (In the above 
example, where we derived 4). The retrieval from the P field back into the 
normal modular field is performed by enacting P on the result of !P(A-B)N using 
the precomputed constant//. Now, if P = P(A-B)N, it follows that 
3>(P-H)N = A-B mod N; thereby performing a normal modular multiplication in 
two P field multiplications. 

Montgomery modular reduction averts a series of multiplication and 
division operations on operands that are n and 2n characters long, by performing 
a series of multiplications, additions, and subtractions on operands that are n or 
n + 1 characters long. The entire process yields a result which is smaller than or 
equal to N. For given A, B and odd N there is always a Q, such that A-B + Q-N 
will result in a number whose n LS characters are zero, or: 
P-2 n = A-B + Q-N 

This means that we have an expression that is In characters long 
(with a possible one bit overflow), whose n LS characters are zero. 

Now, for radix r = 2 l \ let /-r n = 1 mod /V (/ exists for all odd AO. 

Multiplying both sides of the previous equation by / yields the following 
congruences: 

from the left side of the equation: 
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P-l-r n = P mod N\ (Remember that I r n = 1 mod AO 
and from the right side: 

A-B-I + Q-N-I = A B I mod /V ; (Remember that Q N-I s 0 mod N) 

therefore: 

P = A-B-I mod N. 

This also means that a parasitic factor /= r' n mod /V is introduced each time a 
P fiel(i~murtipIication is performed? 
We define the P operator such that: 

P s AB-I mod /V = ?(A-B)N. 

and we call this "multiplication of A times B in the J 2 * field", or Montgomery 
Multiplication. 

The retrieval from the P field can be computed by operating J^on P H y makine: 

P(P-H)N = A B mod/V; 

We can derive the value of H by substituting P in the previous congruence. 
We find: 

P(P-H)N = (A-B-I){H)(I) mod N ; 

(see that A-B-I <r- P\ H <- H\ I <— and any 

multiplication operation introduces a parasitic I) 

If H is congruent to the multiple inverse of I- then the congruence 
is valid, therefore: 

H = I' 2 mod /V = r 2n mod N 

(H is a function of N and we call it the H parameter) 
In conventional Montgomery methods, to enact the P operator on 
A -5, the following process may be employed, using the precomputed constant 7: 

1) X = A-B 

2) Y- (X-J) mod r n (only the n LS characters are necessary) 

3) Z = X+Y-N 

4) S = Zl r n (The requirement on 7 is that it forces Z to be 
divisible bv r n ) 
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5) P ¥ S mod N (N is to be subtracted from S, if S > N) 
Finally, at step 5): 

P ¥ P(A-B)N, 

[After the subtraction of /V, if necessary: 

P = P{A-B)N] 
Following the above: 

Y = ABJ mod r n (using only the n LS characters); 

and: 

Z = A-B + (A-B-J mod r")W. 

In order that Z be divisible by r n (the n LS characters of Z are 
preferably zero) and the following congruence will exist: 
[A-B + (A-5-7 mod r n )-N] mod r n =s 0 

In order that this congruence will exist, N-J mod r n is congruent to 

-1 or: 

7= -TV" 1 mod r n . 
and we have found the constant J. 

J, therefore, is a precomputed constant which is a function of /V only. 
However, in a machine that outputs a MM result, character by character, 
provision should be made to add in Ns at each instance where the output 
character in the LS string would otherwise have been a zero, thereby obviating 
the necessity of precomputing / and subsequently computing Y ~ A B J mod r n , 
as Y can be detected character by character using hardwired logic. We have also 
described that this methodic can only work for odd A^s. 

Therefore, as is apparent, the process described employs three 
multiplications, one summation, and a maximum of one subtraction, for the 
given A, 5, /V, and a precomputed constant to obtain fP{A-B)N. Using this result, 

the same process and a precomputed constant, H, (a function of the module AO 
we are able to find A-B mod N. As A can also be equal to B, this basic operator 
can be used as a device to square or multiply in the modular arithmetic. 
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Interleaved Montgomery Modular Multiplication 

The previous section describes a method for modular multiplication which 
involved multiplications of operands which were all n characters long, and 
results which required 2n + 1 characters of storage space. 

Using Montgomery's interleaved reduction as described in PI, it is 
_possible_to_perfoim_the„multiplication_operaacms-- with—shorter—operands^ 
registers, and hardware multipliers; enabling the implementation of an electronic 
device with relatively few logic gates. 

First we will describe how the device can work, if at each iteration of the 
interleave, we compute the number of times that N is added, using the J 0 
constant. Later, we describe how to interleave, using a hardwire derivation of Y 0l 
which will eliminate the 7o+ phase of each multiplication {(2) in the following 
example}, and enable us to integrate the functions of two separate 
serial/multipliers into the new single generic multiplier which can perform 
A-B + C-N + 5 at better than double speed using similar silicon resources. 

Using a k character multiplier, it is convenient to define segments of k 
character length; there are m segments in n characters; i.e., m-k = n. 
Jo will be the LS segment of 7. 
Therefore: 

Jo = -Nq~^ mod (7 0 exists as /V is odd). 

Note, the J and J 0 constants are compensating numbers that when enacted 
on the unreduced output, tell us how many times to add the modulus, in order to 
have a predefined number of least significant zeros. We will later describe an 
additional advantage to the present serial device; since, as the next serial bit of 
output can be easily determined, we can always add the modulus (always odd) to 
the next intermediate result. This is the case if, without this addition, the output 
character, the LS serial bit exiting the CSA, would have been a "1"; thereby 
adding in the modulus to the previous even intermediate result, and thereby 
promising another LS zero in the output string. Remember, congruency is 
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maintained, as no matter how many times the modulus is added to the result the 
remainder is constant. 

In the conventional use of Montgomery's interleaved reduction, !P{A-B)N 
is enacted in m iterations as described in steps (1) to (5): 

Initially 5(0) = 0 (the ¥ value of 5 at the outset of the first iteration). 
For / = 1, 2....Z7? : 

— Y y X ^ s ^- r p^_^ a _ ^character of A ; S(i-l) is the value 

of 5 at the outset of the / 'th iteration.) 

2) Y 0 = Xq Jq mod r k (The LS k characters of the product of Xq-Jq) 
(The process uses and computes the kLS characters only, 

e.g., the least significant 64 characters) In the preferred implementation, this 
step is obviated, because in a serial machine Y Q can be anticipated character by 
character. 

3) Z=X+Y 0 -N 

4) 5(0 = Z/r k (The k LS characters of Z are always 0, therefore Z is 
always divisible by A This division is tantamount to a k character right 
shift as the LS k characters of Z are all zeros; or as will be seen in the circuit, the 
LS k characters of Z are simply disregarded. 

(5) 5(0 = 5(0 mod N (N is to be subtracted from those 5(0's which 
are larger than N ). Finally, at the last iteration (after the subtraction of N, when 
necessary), C = S(m) = ?{A-B)N. 

To derive F = A-B mod N, the P field computation, <P(C H)N, is performed. 

It is desired to know, in a preferred embodiment, that for all 5(0's, 5(0 is 

smaller than IN. This also means, that the last result (5(m)) can always be 

reduced to a quantity less than /V with, at most, one subtraction of /V. 
We observe that for operands which are used in the process: 
5(/- 1) < r n+1 (the temporary register can be one bit longer than the B or N 

register), 



supenrap 4 



44 



1 1/13/2000 -1:0-1 PM 



B < N < r n and A. , < 
i-i 



By definition: 

5(i") = Z/rk 



(The value of S at the end of the process, before a 

possible subtraction ) 



For all Z Z(z") < r 



X max = 5 ma x +A, B < r n+l - 1 + (r k -l)(Al) 
2max= YoN<(S-l)(r n -l) 



therefore 



Z, 



max 



< r 




.k+n+ 1 



-l. 



and as Z r 



is divided by r : 



S(/n)<r n+I -r 1 . 



Because A^min> r n - r, 5(/n) max is always less than and therefore, one 

subtraction is all that is necessary on a final result. 



Example of a Montgomery interleaved modular multiplication: 

The following computations in the hexadecimal format clarify the 
meaning of the interleaved method: 

/V=a59, (the modulo), A — 99b, (the multiplier), B = 5c3 (the multiplicand), 
n = 12, r= 2, (the character length of AO, k = 4, (the size in characters of the 
multiplier and also the size of a segment), and m = 3, as n = k-m. 

7 0 = 7as7-9 = -l mod 16 and H = 2 2 * 12 mod a59 = 44b. 

The expected result is F = /\-fl mod /V s 99b*5c3 mod a59 = 37581 1 
mod a59 = 220 1 g. 
Initially: 5(0) = 0 

Step 1 X = S(0)+/\ 0 ' 5 = 0 + b - 5c3 = 3f61 



5(m) maA - 



Mnin = (r n+l - r 1 - 1) - (r n - 1) = r Q - 4 < A^ 



K 0 = X 0 Jo mod r k = 7 (y 0 - hardwire anticipated in SuperMAP) 



Z = X + Yq-N = 3f61 + 7-a59 = 87d0 
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5(l) = Z/rk = 87d 



Step 2 X = 5(l)+/i 1 fl = 87d + 9-5c3 = 3c58 

Yq = Xq J 0 mod = 8-7 mod 2 4 = 8 (Hardwire anticipated) 
Z = X + Yq-N = 3c58 + 52c8 = 8f20 
5(2) = Z//- k =8f2 . 

5^/7 3 X = 5(2) + A 2 £ = 8f2 + 9-5c3 = 3ccd 

Yq = d-7 mod 2 4 = b (Hardwire anticipated) 
Z = AT + r 0 -Af = 3ccd + b-a59 = aeaO 
5(3) = Z/A = aea, 

as 5(3) > /V , 

S(m) = 5(3) - N = aea - a59 = 91 
Therefore C = P(A-B)N = 9 1 16 . 

Retrieval from the P field is performed by computing ZP(C-ff)N: 
Again initially: 5(0) = 0 

Step 1 X = 5(0) + Cq -H = 0 + 1 -44b = 44b 

Yq = d (Hardwire anticipated in SuperMAP) 
Z = X + Y 0 -N = 44b + 8685 = 8ad0 

S(l) = Z/rk = 8ad 

Step 2 X = 5(l) + Ci// = 8ad + 9-44b = 2f50 

Y 0 = 0 (Hardwire anticipated in SuperMAP) 
Z = X + Y 0 -N = 2f50 + 0 = 2f50 

5(2) = Z/rk = 2f5 
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Step 3 X = 5(2) + Co-H = 2f5 + 0*44b = 2f5 

i^o = 3 (Hardwire anticipated in SuperMAP) 
Z = X + K 0 -/V = 2f5 + 3-a59 = 2200 

5(3) = Z//-k = 220 16 
which is the expected value of 99boc3 mod a59. 

If at each step we disregard k LS zeros, we are in essence multiplying the n MS 
characters by r k . Likewise, at each step, the / 'th segment of the multiplier is also 
a number multiplied by r^, giving it the same rank as 

It can also be noted that in another preferred embodiment, wherein it is of 
some potential value to know the J 0 constant, 
Exponentiation 

The following derivation of a sequence [D. Knuth, The art of computer 
programming, vol. 2: Seminumerical algorithms, Addison-Wesley, Reading 
Mass., 1981] hereinafter referred to as "Knuth", explains a sequence of squares 
and multiplies, which implements a modular exponentiation. 

After precomputing the Montgomery constant, H~ 2 2n , as this device can 
both square and multiply in the P field, we compute: 

C - mod N. 

Let E(j) denote the j bit in the binary representation of the exponent £, starting 
with the MS bit whose index is 1 and concluding with the LS bit whose index is 
q, we can exponentiate as follows for odd exponents: 

A* ¥ $>(A-H)N A* is now equal to A-2 n . 

FOR; = 2 TO 4-1 
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IF £(/') = 1 THEN 
B ¥ ?{BA*)N 



ENDFOR 

5-¥-^5-A-)/V ^£(0)-=-4t-S-is-the-tast-desired-temporary- 

result multiplied by 2", 

A is the original A. 

C = B 

C=C-N\f C>N. 

After the last iteration, the value B is ¥ to mod N, and C is the final value. 

To clarify, we shall use the following example: 

£=1011 > E(l) = 1; £(2) = 0; £(3) = 1;£(4) = 1; 

To find A*011 mod N;q = 4 

A* = ?(A-H)N = A-r 2 I = A-r l mod TV 
S=A* 

FOR j=2toq 

B = P(B B)N which produces: A 2 ^" 1 ) 2 / = A 2 /" 1 
£(2) = 0; 5 = A 2 /" 1 

y = 3 5 = P(B-B)N = A 2 (/- 1 ) 2 -/ = A 4 -/" 1 

£(3) =1 B = ?(B A*)N - (A 4 /-l) (A 7" 1)7 = A57" 1 

; = 4 B = ?(BB)N= Al0/- 2 7 = A 10 7-1 
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As £"(4) was odd, the last multiplication will be by A, to remove the parasitic /** 
B = ?{B-A)N = A 10./- 1 . A -I = A 1 1 

A method for computing the H parameter by a reciprocal process is described 
in US Patent 5,513,133. 



Reference is now made to Fig. 3, which is a simplified block diagram showing 
how the present invention may be implemented in smart cards and other security 
devices. An internal bus, 500, links components including a CPU, 502, a RAM, 
504, non-volatile memory, 506, controlled access EEPROM, 508, and modular 
arithmetic coprocessor, 510. As shown herein, the coprocessor, 510, is 
connected via data, 512, and control, 514, registers to the internal bus, 500. The 
controlled access ROM, 508, is connected via address and data latch means, 516, 
and a control and test register, 518. Various other devices may be attached to the 
bus such as a physical sequence random generator, 520, security logic, 522, 
smart card and external port interfacing circuitry, 524, and 526, respectively. 

When a cryptographic program, such as verifying an RSA signature is 
executed, it may require modular arithmetic functions such as modular 
exponentiation. The cryptographic program that calls the cryptographic function 
is preferably run on the CPU, 502. 

Reference is now made to Fig. 4, which is another simplified block diagram of 
an implementation of the present invention for use in a smart card. Parts that are 
the same as those shown in Fig. 3 are given the same reference numerals and are 
not described again, except as necessary for an understanding of the present 
embodiment. In Fig. 4 the CPU 502 is shown with an external accumulator 
7350. Data Disable Switch, 7340, detaches the CPU Accumulator from the Data 
Bus 500, while unloading data from the arithmetic coprocessor enables direct 
transfer of data from the SMAP to memory. 

Fig. 5 is a simplified block diagram of a preferred embodiment of a data 
register bank, 6205, within a coprocessor 6075, as depicted in coprocessors of 
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Figs. 2, 6 and 7, with a J 0 generator, wherein the J 0 generator typically compiles 
an I bit primary zero forcing function. 

The coprocessor 6075 is connected to a data bus with a CPU as in previous 
figures. A register bank, 6205, comprises a B register 6070, an A register 6130 
an S register 6180, and an /V register 6200. The outputs of each of the registers 
are connected to a serial data switch and serial process conditioner 6020, which 
-hrtunrrs-COT-necte^^ carries out the modufar 



arithmetic operations. Connected between the N register, 6200, and the 

operational unit, 6206, is a J 0 generator, 552. 
In the embodiment the J 0 generator compiles an L bit primary zero forcing 

function for use in the modular arithmetic functions described above. 
Fig. 6 is a simplified internal block diagram of the operational unit of Fig. 5. 

The unit, preferably supports accelerated squaring operations, in that the 

additional YqBq serial buffer accepts Y 0 in the first phase, and in the second phase 
a modular reduced Bq for a subsequent squaring operation, wherein it is found 
that B is larger than N. 

Reference is now made to Fig. 7A, which is a block diagram of the main 
computational part of the operational unit of Fig. 6. Numbers appearing in 
circles relate to the sequence diagrams of Figs. 7B and 7D. 

Reference is now made to Fig. 7B, which is an event timer pointer diagram 
showing progressively the process leading to and including the first iteration of a 
squaring operation. 

Reference is now made to Fig. 7C which is a generalized event sequence 
showing a method of eliminating, the Next Montgomery Squaring delays in a 
first iteration of a squaring sequence. Circled numbers refer to Figs. 7A, 7B 
and 7D. 

Reference is now made to Fig. 7D which is a generalized event timer pointer 
diagram illustrating the timing of the computational output of the first iteration 
of a squaring operation. 
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Reference is now made to Fig 8A, a set of look up tables, which typically show 
the choices of J 0t which is the negative of the multiplicative inverse over 
modulus 2 l of the right hand character of N Q . As N 0 is always either monic for 
GF(2^) or odd for GF(p), J 0 always exists. 

In Figs. 8A and SB, we refer to this right hand character of the modulus as ;V 0 . 
We refer to iV QJ as the y"th bit of the locally defined /V Q character . 



Fig. 8B is a schematic for designing either a 4 bit or a 2 bit Y 0 zero forcing 
function character. The variable inputs into the force function are the N 0 bits 
(constant throughout a multiplication), the L, 5 0 bits, and the I right hand bits of 
the product of the / multiplier and multiplicand bits, A i0 and B 0jl and the carry 
switch, y, which determines whether functions work in G¥(2 q ) or GF(p). The A 
and B bits are input into a ® multiplier and © added to the 5 0 . When y = 0, all 
carries are disabled. 

It is appreciated that various features of the invention, which are, for clarity, 
described in the contexts of separate embodiments, may also be provided in 
combination in a single embodiment. Conversely, various features of the 
invention, which are, for brevity, described in the context of a single 
embodiment, may also be provided separately or in any suitable subcombination. 

It will be appreciated by persons skilled in the art, that the present invention is 
not limited to what has been particularly shown and described hereinabove. 
Rather, the scope of the present invention includes both combinations and 
subcombinations of the various features described hereinabove as well as 
variations and modifications thereof, which would occur to persons skilled in the 
an upon reading the foregoing descriptiona dn which are not in the prior art. 

In the following claims, symbols such as have the meanings given in the 
preceding description. 
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CLAIMS 

1) A microelectronic apparatus for performing ® multiplication and squaring 
in both polynomial based GF(2 q ) and GF(p) field arithmetic, squaring and 
reduction using a serial fed radix 2 L multiplier, 5, with k character multiplicand 
segments, A\, and a k character © accumulator wherein reduction to a limited 
congruence is performe3~' on the fly" ' in a systolic manner, with A u a 
multiplicand, times 5, a multiplier, over a modulus, yV, and a result being at most 
2k + 1 characters long, including the k first emitting disregarded zero characters, 
which are not saved, where k characters have no less bits than the modulus, the 
apparatus comprising; 

a first (5), and second (AO main memory register means, each register operative to 
hold at least n bit long operands, respectively operative to store a multiplier value 
designated B, and a modulus, denoted /V, wherein the modulus is smaller than 2 n ; 

a digital logic sensing detector, Yq, operative to anticipate "on the fly" when a 
modulus value is to be © added to the value in the © adder accumulator device such 
that all first k characters emitting from the device are forced to zero; 

a modular multiplying device for at least k character input multiplicands, with 
only one, at least k characters long © adder, © summation device operative to 
accept k character multiplicands, the ® multiplication device operative to switch 
into the © accumulator device, in turn, multiplicand values, and in turn to 
receive multiplier values from a B register, and an "on the fly" simultaneously 
generated anticipated value as a multiplier which is operative to force k first 
emitting zero output characters in the first phase, wherein at each effective 
machine cycle at least one designated multiplicand is © added into the © 
accumulation device; 

the multiplicand values to be switched in turn into the © accumulation device 
consisting of one or two of the following three multiplicands, the first 
multiplicand being an all-zero string value, a second value, being the 
multiplicand A u and a third value, the No segment of the modulus; 
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an apparatus to anticipate the /bit k character serial input Y 0 multiplier values; 

the multiplier values which are input in turn into the multiplying device in the 
first phase being first the B operand, and concurrently, the second multiplier 
value consisting of the Y 0j "on the fly" anticipated k character string, to force 
first emitting zeroes in the output; 

an © accumulation device, operative to output values simultaneously as 
multiplicands are © added into the © accumulation device; 

an output transfer mechanism, in the second phase operative to output a final 
modular ® multiplication result from the © accumulation device. 

2. An apparatus as in claim 1 wherein © summations into the © accumulation 
device are activated by each new serially loaded higher order multiplier 
characters. 

3. An apparatus as in claims 1 and 2, wherein the multiplier characters; 

are operative to cause no © summation into the © accumulation device if both 
the input B character and the corresponding input Y Q character are zeroes; 

are operative to © add in only the A\ multiplicand if the input B character is a 
one and the corresponding Yq character is a zero; 

are operative to © add in only the TV, modulus, if the B character is a zero, and 
the corresponding Yq character is a one; and 

are operative to © add in the © summation of the modulus, A^, with the 
multiplicand A x if both the B input character and the corresponding Y 0 character 
are ones. 

4. An apparatus as in claim 1, operative to preload multiplicand values A\ and 
/V, into two designated preload buffers, and to © summate these values into a 
third multiplicand preload buffer, obviating the necessity of © adding in each 
multiplicand value separately. 

5. An apparatus as in claim 1, wherein the multiplier values are serial single 
character in input and the output of the © accumulation device is serial single 
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character output, wherein the Y 0 detect device is operative to anticipate only one 
character in a clocked turn. 

6. An apparatus as in claim 1, wherein the © accumulation device performs 
modulo 2 ? XOR addition/subtraction, wherein all carry bits in addition and 
subtraction components are disregarded, thereby precluding provisions for 
overflow and further limiting convergence in computations. 
— 7^A-®-multipl4catk>n— apparatus-as-in-c-laim-l~-wherein-all-earry~inputs-are 
disabled to zero, denoted, y=0, typically operative to perform polynomial based 
multiplication. 

8. An apparatus as in claim 1 wherein an if equal to zero acting on an element 

in a circuit equation computing in GF(2' ? ), the if designates omitted circuitry and 

all adders and subtracters, designated © have been reduced to XOR, modulo 2 
addition/subtraction elements. 

9. An apparatus as in claim 1 wherein k first emitting zeroes will egress from 
the device controlled by the following four quantities in anticipating the next in 
turn Yq character: 

i the /bit 5 out bits of the result of the /bit by /bit mod 2 L ® multiplication of 
the 

right-hand character of the A\ register times the S d character of the B Stream, 
Ao-Bt mod 2 l \ 

ii the first emitting carry out character from the ® accumulation device, 
WOo); 

iii the /bit S out character from the second from the right character emitting 

cell of 
the © accumulation device, SO\ ; 

iv the /bit Jo value, which is the negative multiplicative inverse of the 
right-hand 

character in the /V 0 modulus multiplicand register. 
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wherein values, A 0 -5 d mod 2 £ , if{COo), and SO\ are © added character to 
character together and "on the fly" multiplied by the J 0 character to output a 
valid Y 0 zero-forcing anticipatory character to force an /bit egressing string cf 
zeroes. 

10. An apparatus as in claim 1, wherein ® multiplication on polynomial based 
operands is performed in a reverse mode, multiplying from right hand MS 
characters to left hand LS characters, operative to perform modular reduced ® 
multiplication without Montgomery type parasitic functions. 

11. An apparatus as in claim 1 where the preload buffers are serially fed and 
where multiplicand values are preloaded into the preload buffers on the fly from 
a multiplicity of memory devices. . 

12. An apparatus as in claim 1, wherein a previous value, emitting from an 
additional n bit register, 5, is © summated into the output value of the 
© accumulation device via an /bit © adder circuit such that first emitting output 
characters are zeroes when the Yo detector is operative to detect the necessity of 
© adding moduli to the © summation in the © accumulation device, wherein the 
Yo detector is operative to detect utilizing .the next in turn ©added characters 
A 0 -B d mod 2\ ^(CO 0 ), SO\. S d and ^f{CO z ), the composite of © added characters 
to be finite field ® multiplied on the fly by the /bit J 0 value, where © defines the 

addition and ® defines the multiplication as befits the finite field used in the 
process. 

13. An apparatus as in claim 1, wherein for /= 1, 7 0 is implicitly 1, and the J 0 
® multiplication is implicit, without additional hardware. 

14. An apparatus as in claim 1 wherein a comparator is operative to sense a 
finite field output from the ® modular multiplication device, working in GF(p), 
where the first right hand emitting k zero characters are disregarded, where the 
output is larger than the modulus, thereby operative to control a modular 
reduction whence said value is output from the memory register to which the 
output stream from the multiplier device is destined, and thereby precluding 
allotting a second memory storage device for the smaller product values. 
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15. A device as in claim 1 wherein for® modular multiplication in the GF(2 q ), 
the apparatus is operative to multiply without an externally precomputed more 
than /bit zero-forcing factor. 

16. A method according to claim 1 operative to compute a J 0 constant by 
resetting either the A operand value or the B operand value to zero and setting 
the partial result value, So, to 1. 

— 4-7— A-mieroeleetroniG-method-and-apparatus-fe^perfofTOing-interleaved-finite- 
field ® modular multiplication of integers A and B operative to generate an 
output stream of A times B modulus N wherein n the number of characters in the 
modulus operand register is larger than k, wherein the ® multiplication process 
is performed in iterations, wherein at each interleaved iteration with operands 
input into a ® multiplying device, consisting of N t the modulus, B, a multiplier, 
a previously computed partial result, 5, and a k character string segment of A, a 
multiplicand, the segments progressing from the Aq string segment to the A m _i 
string segment, wherein each iterative result is © summated into a next in turn 5, 
temporary result, in turn, wherein first emitting characters of iterative results are 
zeroes, the apparatus comprising: 

first (5), second (5) and third (N) main memory registers, each register capable of 
storing and outputting operands, respectively operative to store a multiplier value, a 
partial result value and a modulus, also denoted /V; 

a modular multiplying device operative to © summate into the © accumulation 
device, in turn one or two of a plurality of multiplicand values, in turn, during 
the phases of the iterative ® multiplication process, and in turn to receive as 
multipliers, in turn, inputs from a first value B register, second, from an "on the 
fly" anticipating value, Yq, as a multiplier to force first emitting right-hand zero 
output characters in each iteration, and third values from the modulus, N, 
register; 
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the multiplicand parallel registers operative at least to receive in turn, values 
from the A, 5, and /V register sources, and in turn, also a multiplicand zero 
forcing Yq, value; 

a first emitting zero forcing Yq detect device operative to generate a binary 
string operative to be a multiplier during the first phase and operative to be a 
multiplicand in the second phase; 

— multiplicand-values-to-be-switchedinto-the accumulati^n-device-for-the-first 

phase consisting of a first zero value, a second value, Aj, which is a k character 
string segment of a multiplicand, A, and a third value /V 0 , being the first emitting 
k characters of the modulus, yV; 

a temporary result value, 5, resulting from a previous iteration, operative to 
be summated with the value emanating from the accumulation device, to 
generate a partial result for the next in turn iteration; 

multiplicand values to be input, in turn, into the accumulation device for the 
second phase being, a first zero value, a second A\ operand, remaining in place 
from the first phase, and a third Yq value having been anticipated in the first 
phase; 

multiplier values input into the multiplying device in the first phase being a 
first emitting string, Z?o, being the first emitting string segment of the B operand, 
concurrently multiplying with the second multiplier value consisting of 
the anticipated Yq string which is simultaneously loaded character by character 
as it is generated into a preload multiplicand buffer for the second phase; 

the two multiplier values input into the apparatus during the second phase 
being the left hand n - k character values from the B operand, designated 5, and 
the left hand n - k characters of the N modulus, designated /V, respectively; and 

a multiplying flush out device operative in the last phase to transfer the left 
hand segment of a result value remaining in the accumulation device into a 
result register. 

18. An apparatus as in claim 17, wherein multiplication on polynomial 
based operands is performed in a reverse mode, multiplying from MS characters 
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to LS characters, operative to perform modular reduction without Montgomery 
type parasitic functions. 

19. An apparatus operative to anticipate the Y 0 value using first emitting values 
of the multiplicand, and present inputs of the B multiplier, carry out values from 
the accumulation device, summation values from the accumulation 
device, the present values from the previously computed partial result, and carry 

—out value-s from— the adder— which — — summates — the — result frorrr 

the accumulation device with the previous partial result. 

20. An apparatus as in claim 19 wherein k first emitting zeroes will egress from 
the device controlled by the following six quantities in anticipating the next in 
turn Yq character: 

i the /bit 5 ou t bits of the result of the /bit by /bit mod 2 L multiplication of 
the 

right-hand character of the A\ register times the Z? d character of the B Stream, 
A Q -B d mod 2 L \ 

ii the first emitting carry out character from the accumulation device, 
^(CO 0 ); 

iii the /bit 5 0ut character from the second from the right-hand character 
emitting 

cell of the accumulation device, SO\\ 

iv the next in turn character value from the 5 stream, S d ; 

v the / bit carry out character from the Z output full adder, ^f{CO z )\ 

vi the /bit Jo value, which is the negative multiplicative inverse of the 
right-hand 

character in the Nq modulus multiplicand register; 
wherein values, A Q -B d mod 2 l , J/(COo), SO\, S d are added character to 

character together and "on the fly" multiplied by the 7 0 character to output a 
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valid Y 0 zero-forcing anticipatory character to force an / bit egressing character 
string of zeroes. 

21. An apparatus as in claim 17 comprised of at least one sensor operative to 
compare the output result to /V, the modulus, the mechanism operative to actuate 
a second subtractor on the output of the result register, thereby to output a 
modular reduced value which is limited congruent to the output result value 

— preeluding-the-neeessity-to-allot-a-second-memory-storage-for-a-smaller result: 

22. An apparatus as in claim 17 where a value which is a summation of two 
multiplicands is loaded into a preload character buffer with at least a k characters 
memory means register concurrently whilst one of the values is loaded into a 
preload buffer. 

23. An apparatus with only one accumulation device, and an anticipating 
zero forcing mechanism operative to perform a series of interleaved modular 
multiplications and squarings concurrently performing the equivalent of three 
natural integer multiplication operations, such that a result is an exponentiation. 

24. An apparatus as in claim 17 where next in turn used multiplicands are 
preloaded into preload register buffer means on the fly. 

25. An apparatus as in claim 17 where a value which is a summation of two 
multiplicands is summated into at least a k character register concurrently 
whilst one of the values is loaded into its preload buffer. 

26. An apparatus as in claim 17 wherein apparatus buffers and registers are 
operative to be loaded with values from external memory sources and said 
buffers and registers are operative to be unloaded into the external memory 
source during computations, such that the maximum size of the operands is 
dependent on available memory means. 

27. An apparatus as in claim 17 wherein memory register means are typically 
serial single character in/serial single character out, parallel at least k 
characters in/parallel at least k characters out, serial single character in/parallel at 
least k characters out, and parallel k characters in/serial single character out. 
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28. An apparatus as in claim 17 wherein the final phase of a multiplication 
type iteration, the multiplier inputs are zero characters operative to flush out the 
left hand segment of the carry save accumulator memory. 

29. An apparatus as in claim 17 where next in turn used multiplicands are 
preloaded into preload memory buffers on the fly. 

30. An apparatus as in claim 17 where multiplicand values are preloaded into 
thepreload-buffers-on-the-fl-y-from-central-storage-memory-means. 

31. A method according to claim 17 comprising computing 7 0 = Yo for / = 1 by 

resetting both A and B to zero and setting So = 1. 
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Wlmr is claimed is: 



CLAIMS 



! ■ A ...efcod :br perrbrrrung a modular arithmetic muiripticaaon. 

A- B(modN), 

an .meger modulus, denoted N'. ^ s base r .teger anchmcne and without 
perro,™,,,, an mreger divis.cn operation, wherein at least one of A. B one V is a 
large mtcuer. comprising: 

seiecr.ng a segment iengih. denoted k. wherein a scsmer.t is a k- 
diamctcr integer, and -a herein a character is an integer between 0 and r- ! : 

representing each of A. B and N in base r as a sequence of at least 

one segment; and 

penbrming the following operarions ar mosr m omes: 

muitipiying two k-character integers in base r. 
summing rwo k-charac:er integers in base r. and 
adjusang a k^haracrer base r integer so as co force 
its least significant characrer to be zero. 



A method according to claim I wherein said multipivin 2 two k- 
character integers induces muitipiying one of the segments of A and one of the 
segments of B. 



A me±od *5»raing to claim 1 wherein said mitolvine two k- 
character integers includes muitipiying a k-character integer by a ' pre^omouted 
constant J„ = (mo d n. wherein No denotes the least significant seament of N. 



A method according to claim 1 wherein said summing uses carry 

sum addition. 



A me:hod wording to ciaim I wherein said adjusting a k-character 
integer operates by adding an integral muJdpie ofNto the Character integer. 

6 ' A me *od for perrbrming a modular arithmetic exponentiation. 

A c imodN) , 

of an integer, denoted A. raised to an mteger power, denoted E. modulo an inienr 
modulus. denoted N". using base r integer anthmetic and without pertbrming "an 
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-megcr division operation. wherein at lease one of A. E and N is a large inte-er 
compnsing; ~ 3 

seieenng a segment length, denoted k. wherein a segment is a k- 
characrcr integer, or.c -.vherein a chancre.- is an integer between 0 and r-i: 

representing each of A. E and N in base r as a sequence of at least 

one scirment: and 



_ : --"^^^-^o-k-chanicrer-mfegers-in-base-r: 

iu~rrung two k-charac:er integers in base r. and 

aqusnng a k-charac:er base r integer so as to force its least 
significant charterer :o re zero. 



A method according co claim 6 wherein said adjusting a k<hanic:er 
integer operates by aoding an integral mulnpie of Nto the k<harac:er integer. 

• A * mecnoa tor public key encryption comprising: 
perrbrming a modular arithmetic multiplication. 

A- BimodN). 

of an integer muJtipiier. denoted A. with an integer multiplicand denoted B. modulo 
an integer modulus, denoted N. using base r integer arithmetic and without 
perrbrming an integer division operanoa wherein at least one of A. B and N is a 
large integer, comprising 

selecting a segment length, denoted k wherein a 
segment is a kcharacrer integer, and wherein a character is an integer between 0 and 
r-l: 

representing each of A. B and N in base r as a 
sequence of at least one segment: and 

perTbrming the following operations at most m 

times: 

multiplying two k-character 

integers in base n 

summing ewo k-character integers 

in base r. and 

adjusting a k-character base r 
integer so as to force its least significant character to be zero. 



9 - A me±od according -co claim 3 wherein said adjusting a k^haracter 

integer operates by adding an integral multiple of Nto the .^character integer. 
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10. A microdecronic circuit tor perform^ a modular arithmetic 

mulriplicanon. 

A- B (mod >D . 

of an integer multiplier, denoted A. with an integer muiapticand denoted 3. modulo 
an integer moduli*, denoted \\ using base r integer arithmetic arc without 

iargc integer, comprising: 

memory regiszers storing representanons of each of A. 3 and N in 
base r as a sequence of at least one segment wherein a segment is a Character 
integer, and wherein a character is an integer berween 0 and f I. and wherein k is a 
pre -selected segment length; 

a multiplier multiplying two k-characrer integers in base c 
on accumulator summing two k-character integers in base r. and 
a ioeicoi unit adiusnng a k-character base r integer so as to force its 
ieast significant characrer to be zero. 



<*- A microelectronics circuit according to claim 10 wherein said 

multiplier multiplies one of the segments of A and one of the segments of B. 

l - A microelectronics circuit according to claim 10 wherein said 

multiplying two k-characrer integers includes multiplying a k-characrer integer by a 
pre-computed constant J, = -No' 1 (mod r). wherein N„ denotes the least *grnncant 
segment of N. 



lj - A rrucroeiectronics circuit according to claim 10 wherein said 

accumulator pertbrms carry sum addition. 



I ~- A rrucrceiecrrorucs circuit according to claim 10 wherein said logical 

unit adjusts a k-characrer integer by adding an integral multiple of N* to the k- 
characrcr integer. 

I: * A microeiectrortics circuit for performing a modular arithmetic 

exponentiation. 

A E (modNl . 

of an integer, denoted A. msed to an integer power, denoted E. modulo an integer 
modulus, denoted V. using base r inreger arithmenc and without penbrming an 
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.n,c,er division opearion. wherein at ieast one of A.'E and N is a Ian- ,nte-er 
compnsing: ~" - ■ 

memory .-castes storjig representations of each of A E and N in 
base , us a .sequence of at (east one segment wherein a sarnies* is a chance 

imC - ! '- '" ld tttecs! 3 * ^ berxveen 0 and F |. and wherein k i s a 

pre -.-.elected segment .entrth: 



.o-:r.uiqpik r muidpiying ^vo -k-^hafacterintegersln'&dsS-rT 



an accumulator suminns wo k-character integers in base n and 
a .oacai urat adjusting a k-eharacter base r integer so as :o ;brc= its 
least significant chzrzczer to be zero. 



ld A ^electronics circuit according to claim 15 wherein said loacai 

unit adjusts a k-character integer by adding an integral muiriple of N" to the k- 

character integer. 



' .A smart card comprising: 

a rnicroelecironics circuit imprinted on the smart card for perrbrmine 
a modular arithmeric multiplication. 

A- B(modN), 

of an .nteger multiplier, denoted A. wid, an integer multiplicand, denoted B. modulo 
an integer modulus, denoted N, usmg base r integer arithmetic and without 
performing an integer division operation wherein at leasr one of A. B and N is a 
large integer, comprising: 

memory registers storing representations of each of 
A. B and N tn base r as a sequence of at least one segment wherein a segment is a k- 
charactcr tnteger. and wherein a characrer is an inte^r between 0 and r-i. and 
wherein k is a pre-seiec;ed segment leneth: 

a muinpiier multiplying two k-character integers in 

base r: 

an accumulator summing two k-character inreeers 

m base n and 

a logical unit adjusting a k-character base r integer 
so as ;o force its least significant characrer :o be zero. 



!S - A smart card according :o claim 17 wherein said logical una adjusts 

a k-character integer by adding an integral muinple of N to the Character integer. 

1 * • A puouc .<ey encryption system comprising: 

a processor for penbiming a modular arithmeric mulriDiicarion. 
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A* B(modN). 

of an integer multiplier, denoted A. with an tnteger multiplicand denoted B. moduJo 
ail i.ncgcr modulus, denoted N. using base r integer anthmenc and without 
performing an integer division operation, wherein at least one or' A. B and N is a 
large integer, comprising: 

memory registers storing representations of each of 
— A-B -oid-N-in-bas^ least one segment, wherein a segment is a k- 

characrcr integer, and wherein a character is an integer berween 0 and r- 1, and 
wherein k is a pre-seiected segment ieneth: 

a multiplier multiplying two k-character intesers in 

base r: 

an accumulator slimming rwo k-character integers 

in base r, and 

a logical unit adjusting a k-character base r integer 
so as to force its feast significant character to be zera 

2a A P ubiic ^y encryption system according to claim IS wherein said 

logical unit adjusts a ^character integer by adding an integral multiple of N to the k- 
charactcr integer. 



For the Applicant; 



Sanford T. Ccrtb 5 Co. 
C: 40267 
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